Repo2Txt
is a Python script that allows you to interactively traverse and analyse the contents of a GitHub repository or a local folder. It extracts the structure and contents of selected files and folders and saves the information to a text file.
- Traverse and analyse both local directories and GitHub repositories.
- Saves the analysis, including repository structure and file contents, to a text file.
- Skips binary files, handles different encodings for text files, and excludes junk directories (e.g.,
__pycache__
,.git
,.hg
,.svn
,.idea
,.vscode
,node_modules
).
Additional Features/Improvements in This Repo (not present in /Doriandarko/RepoToTextForLLMs):
- Interactively select specific branches, folders, and files for analysis, with an option to include or exclude sub-folders.
- Count tokens for selected files and include token statistics in the analysis for easier prompt pruning.
- Python 3.6 or later
PyGithub
library: Install it usingpip install PyGithub
tqdm
library: Install it usingpip install tqdm
tiktoken
library: Install it usingpip install tiktoken
- GitHub Personal Access Token (PAT) for accessing private repositories
-
Clone the repository or download the script.
git clone https://github.com/your-username/repo2txt.git
-
Navigate to the directory containing the script.
cd repo2txt
-
Install the required Python packages.
pip install PyGithub tqdm
-
Ensure you have a GitHub Personal Access Token (PAT). Set it as an environment variable named
GITHUB_TOKEN
.export GITHUB_TOKEN='your_github_token'
-
Run the script.
python repo2txt.py
-
Follow the prompts to enter the GitHub repository URL or the path to a local folder.
-
Interactively select the folders and files you wish to analyse. You can choose to include or exclude sub-folders.
-
If you want to count tokens in the files, use the --count-tokens flag when running the script.
python repo2txt.py --count-tokens
-
The script will save the analysis, including the repository structure, file contents, and token statistics, to a text file in the current directory.
Enter the GitHub repository URL or the path to a local folder:
https://github.com/your-username/your-repo
Fetching README for: your-repo
Fetching repository structure for: your-repo
Contents of :
1. .git (dir)
2. .github (dir)
3. src (dir)
4. tests (dir)
5. README.md (file)
Enter the indices of the folders/files you want to extract (e.g., 1-5,7,9-12) or 'a' for all: 3,4,5
Do you want to select sub-folders in src? (y/n/a): a
Do you want to select sub-folders in tests? (y/n/a): n
Fetching contents of selected files for: your-repo
Repository contents saved to 'your-repo_contents.txt'.
- The script skips binary files and certain file types by default.
- If a file cannot be read due to unsupported encoding, it will be skipped with a corresponding message in the output file.
- This repo is forked adjusted from - https://github.com/Doriandarko/RepoToTextForLLMs
Contributions are welcome! Please feel free to submit a pull request or open an issue to discuss any changes.
This project is licensed under the MIT License. See the LICENSE file for details.