dev-resources.site
for different kinds of informations.
Transform Your Codebase into Comprehensive Documentation with Markdown
Introduction
Welcome to the age of AI. The world is moving at lightning speed towards artificial intelligence, and programmers have an array of built-in tools in code editors like Zed, VSCode, and Cursor. These editors have the capability to analyze large codebases and assist in resolving issues or creating features.
I've tested many of these editors, but sometimes even including the codebase in chat doesn't provide the full picture, which means the results often fall short or lack the quality the repository requires. The worst-case scenario arises when an AI chat starts producing circular problems: solving one problem introduces a new issue, and fixing that issue reintroduces the first problem.
The core challenge here is the limited access to the whole codebase due to constraints on the number of files or file sizes that AI tools can process.
Moreover, let’s talk about direct AI models like Claude, Perplexity, and ChatGPT. Since their inception, these tools have come a long way. Now, ChatGPT allows attachment of files in chat, but it still doesn't support submitting zipped folders or entire repositories, meaning it cannot consider your whole code. The same limitation exists for Claude and Perplexity. It would be incredibly beneficial if we could give these AI tools our code in a compressed form—something that isn’t spread across hundreds of files and is readable by the AI.
Solution: A Ruby Script to Convert a Codebase into Markdown
Why Ruby?
The first question that comes to mind is: why Ruby?
Simply put, I like and work in Ruby. There is no ulterior or mind-boggling reason.
What Does This Ruby Script Do?
You provide the root folder path of your codebase to the script, and it will create Markdown files, each with a maximum size of 100KB. These files will contain code blocks with the content of each file, alongside a project structure tree. Files and folders listed in your .gitignore
or specified in the script will be excluded.
Why Markdown and 100KB Files?
Markdown: Markdown is a lightweight markup language that works well for documentation. It’s text-based, making it easy to read and compatible with most personal knowledge management tools like Notion or Obsidian.
File Size Limitation: Some AI tools, especially Claude, do not read files larger than 100KB. Therefore, the script enforces this limit. You can adjust the limit by changing the script parameters.
The Ruby Script: ruby_to_md.rb
Below is the Ruby script that converts your codebase into Markdown files:
https://gist.github.com/sulmanweb/ee1541b1739b06db6695370cbc8a480d
require 'fileutils'
require 'digest'
ALWAYS_IGNORE = ['.git', 'tmp', 'log', '.ruby-lsp', '.github', '.devcontainer', 'storage', '.annotaterb.yml', 'public', '.cursorrules'].freeze
IGNORED_EXTENSIONS = %w[.jpg .jpeg .png .gif .bmp .svg .webp .ico .pdf .tiff .raw .keep .gitkeep .sample .staging].freeze
MAX_FILE_SIZE = 1_000_000 # 1MB
CHUNK_SIZE = 100_000 # 100KB
def read_gitignore(directory_path)
gitignore_path = File.join(directory_path, '.gitignore')
return [] unless File.exist?(gitignore_path)
File.readlines(gitignore_path).map(&:chomp).reject(&:empty?)
end
def ignored?(path, base_path, ignore_patterns)
relative_path = path.sub("#{base_path}/", '')
return true if ALWAYS_IGNORE.any? { |dir| relative_path.start_with?(dir + '/') || relative_path == dir }
return true if IGNORED_EXTENSIONS.include?(File.extname(path).downcase) || File.basename(path) == '.keep'
ignore_patterns.any? do |pattern|
File.fnmatch?(pattern, relative_path, File::FNM_PATHNAME | File::FNM_DOTMATCH) ||
File.fnmatch?(File.join('**', pattern), relative_path, File::FNM_PATHNAME | File::FNM_DOTMATCH)
end
end
def convert_to_markdown(file_path)
extension = File.extname(file_path).downcase[1..]
format = extension.nil? || extension.empty? ? 'text' : extension
begin
content = File.read(file_path, encoding: 'UTF-8')
"## #{File.basename(file_path)}\n\n```
#{format}\n#{content.strip}\n
```\n\n"
rescue StandardError => e
"## #{File.basename(file_path)}\n\n[File content not displayed: #{e.message}]\n\n"
end
end
def generate_tree_markdown(tree, prefix = '')
result = ''
tree.each do |key, value|
result += "#{prefix}- #{key}\n"
result += generate_tree_markdown(value, prefix + ' ') if value.is_a?(Hash)
end
result
end
def write_chunked_output(output_file, content)
base_name = File.basename(output_file, '.*')
extension = File.extname(output_file)
dir_name = File.dirname(output_file)
chunk_index = 1
offset = 0
while offset < content.length
chunk = content[offset, CHUNK_SIZE]
chunk_file = File.join(dir_name, "#{base_name}_part#{chunk_index}#{extension}")
File.open(chunk_file, 'w:UTF-8') do |file|
file.write("---\n")
file.write("chunk: #{chunk_index}\n")
file.write("total_chunks: #{(content.length.to_f / CHUNK_SIZE).ceil}\n")
file.write("---\n\n")
file.write(chunk)
end
puts "Markdown file created: #{chunk_file}"
offset += CHUNK_SIZE
chunk_index += 1
end
end
def process_directory(directory_path, output_file)
ignore_patterns = read_gitignore(directory_path)
markdown_content = "---\nencoding: utf-8\n---\n\n# Project Structure\n\n"
file_contents = []
file_tree = {}
Dir.glob("#{directory_path}/**/*", File::FNM_DOTMATCH).each do |file_path|
next if File.directory?(file_path)
next if ['.', '..'].include?(File.basename(file_path))
next if ignored?(file_path, directory_path, ignore_patterns)
next if File.size(file_path) > MAX_FILE_SIZE
relative_path = file_path.sub("#{directory_path}/", '')
parts = relative_path.split('/')
current = file_tree
parts.each_with_index do |part, index|
if index == parts.size - 1
current[part] = nil
else
current[part] ||= {}
current = current[part]
end
end
file_contents << convert_to_markdown(file_path)
end
markdown_content += generate_tree_markdown(file_tree)
markdown_content += "\n# File Contents\n\n"
markdown_content += file_contents.join("\n")
write_chunked_output(output_file, markdown_content)
end
if ARGV.length != 2
puts "Usage: ruby script.rb <input_directory> <output_file>"
exit 1
end
input_directory = ARGV[0]
output_file = ARGV[1]
process_directory(input_directory, output_file)
Script Usage
The script requires two arguments:
Path to Codebase: The root directory of your project.
Output File Name: The base name for the generated Markdown files.
You can run the script, for example, as follows:
ruby ruby_to_md.rb ~/st/gradwinner gradwinner.md
This will create files like:
gradwinner_part1.md
gradwinner_part2.md
gradwinner_part3.md
and so on.
Note: Files and folders listed in
.gitignore
will be ignored in the resultant documentation files.
Key Script Features
ALWAYS_IGNORE: This constant lists folders and files that need to be ignored in addition to those specified in
.gitignore
.IGNORED_EXTENSIONS: This constant lists the file extensions (e.g., images) that should not be included in the documentation.
CHUNK_SIZE: You can modify this constant to increase or decrease the amount of data in each Markdown file.
MAX_FILE_SIZE: Files larger than this size will be ignored to prevent overwhelming the documentation.
Note: Many chat tools have a limit of 25 files that can be uploaded at once, so you may need to adjust the script according to your requirements.
Conclusion
This Ruby script helps you convert a code repository into documentation in Markdown format, providing a structured overview and content breakdown. It’s an ideal solution when working with AI tools that have file size or file number limitations, making your codebase more accessible to them in a condensed and readable form.
If you have feedback or improvements to suggest, feel free to contribute!
Happy Coding!
Featured ones: