Code Monkey home page Code Monkey logo

repo2txt's Introduction

     ____              ____  __       __
    / _  |__ ___  ___ |__  |/ /___ __/ /_
   / , _/ -_) _ \/ _ \/ __// __/\ \ / __/
  /_/|_|\__/ ,__/\___/____/\__//_\_\\__/
          /_/    for llms & text-mining

Repo2txt: Dump Your Repo/Directory into a Single Text File

Effortlessly consolidate all files within a repository (e.g., GitHub) or any directory structure into a single, easily searchable text file. Ideal for text mining, LLM fine-tuning, embedding generation, and more.

Key Features

  • No Dependencies: Pure Python, single file, no external dependency.
  • Multithreaded: Fast enough, leverages multithreads for better IO performance.
  • Binary File Support: Optionally include binary files (encoded images, sounds, executables...) alongside text.
  • Gitignore Integration: Exclude files and patterns specified in the target directory .gitignore.
  • Human/LLM Friendly Output: Generates a human-readable and structured output, that can be used directly or tokenized to train and fine-tune models.

Use Cases

  • LLM Fine-tuning Data Preparation: Create large text datasets for training language models.
  • Text Mining & Analysis: Extract insights from codebases, documentation, and other textual sources.
  • Embedding Generation: Generate text representations for tasks like semantic search and similarity comparison, helpful to build RAGs.
  • Repository Backups: Create compact, searchable backups of your code projects.
  • Data Versioning: Track changes in code and content over time with a single file to diff (or not).

Installation

  1. Clone this Repository:
    git clone https://github.com/your_username/repo2txt.git
    cd repo2txt
  2. Done

Usage ๐Ÿ“–

Directly execute the main.py script from within the cloned repository:

python main.py -d /path/to/your/repository/to/dump [-t] [-e] [-b] [-g] [-i "*.lock,*.md"] [-o output.txt]

Options:

  • -d, --directory: (Required) The path to the directory you want to dump.
  • -t, --tree: Generate the dump tree only (no file contents, false by default).
  • -e, --embed: Embed the tree at the beginning of the output file (true by default).
  • -b, --binary: Include binary files in the dump (disabled by default).
  • -g, --gitignore: Use the .gitignore file to exclude files (enabled by default).
  • -i, --ignore: Specify additional comma-separated patterns to ignore.
  • -o, --output: Specify the output file name (default is based on directory name).

Examples ๐Ÿ’ก

Dumping All Files (Including Binaries):

python main.py -d /path/to/your/github/repository -b -o my_repo_dump.txt

Generating Tree Structure Only:

python main.py -d /path/to/your/github/repository -t -o my_repo_tree.txt

Output Sample (Tree Only):

+----------------------------------------+
| Dump tree for directory: ../collector/ |
+----------------------------------------+
โ”œโ”€โ”€ .env.test
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ dbs
โ”‚ย โ”œโ”€โ”€ Dockerfile.dbs
โ”‚ย โ””โ”€โ”€ start-test.bash
โ”œโ”€โ”€ forwarder
โ”‚ย โ”œโ”€โ”€ cargo.toml
โ”‚ย โ”œโ”€โ”€ main.rs
โ”‚ย โ”œโ”€โ”€ messages.rs
โ”‚ย โ””โ”€โ”€ server.rs
โ”œโ”€โ”€ main.py
โ”œโ”€โ”€ presets
โ”‚ย โ””โ”€โ”€ markets.yml
โ””โ”€โ”€ tests
ย ย โ”œโ”€โ”€ fowarder.rs
ย ย โ””โ”€โ”€ server.rs

Disclaimers โš ๏ธ

  • Binary Data: Including binary files (images, videos, executables) can significantly increase the output file size and introduce noise. Use the -b option with caution.
  • Ignore Patterns: Utilize .gitignore and the -i option to exclude unnecessary files like logs, caches, and artifacts, which can make the output more manageable and relevant.
  • Output Size: Be mindful of the potential size of the output file, especially when including binary data or large repositories.

Contributing

We welcome contributions! This can be enhanced in many ways:

  • add support for fetching remote repositories (or even ftp) to fetch and dump in seconds
  • performance increase by working on better IO and threading
  • add complex pattern support for fine grained file ignoring
  • add ignore preset files for language specific use cases (this was mostly used with Python repositories)
  • and more...

Feel free to fork the repository, make your changes, and submit a pull request โค๏ธ

License

This project is licensed under the MIT License.

repo2txt's People

Contributors

pde-rent avatar sanchexas avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.