A powerful, lightweight Python library for automated file reading and content extraction. Doc Master simplifies the process of reading various file formats into string representations, making it perfect for data processing, content analysis, and document management systems.
-
Universal File Reading: Seamlessly handle multiple file formats including:
- PDF documents
- Microsoft Word documents (.docx)
- Excel spreadsheets
- Text files
- XML documents
- Images (with base64 encoding)
- Binary files
-
Smart Format Detection: Automatic file type detection and appropriate processing
-
Flexible Output: Choose between string or dictionary output formats
-
Batch Processing: Process entire folders of documents efficiently
-
Encoding Detection: Smart encoding detection for text files
-
Enterprise-Ready: Built with stability and performance in mind
pip install -U doc-master
from doc_master import doc_master
# Read all files in a folder
results = doc_master(folder_path="path/to/folder", output_type="dict")
# Or read a single file
content = doc_master(file_path="path/to/file.docx")
- Python 3.8+
- pandas
- pypdf
- python-docx
- Pillow
We love your input! We want to make contributing to Doc Master as easy and transparent as possible. Here's how you can help:
- Fork the repo
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Check out our Contributing Guidelines for more details.
If you find Doc Master useful, please consider:
- Starring the repository ⭐
- Following us on GitHub
- Joining our Discord community
- Sharing the project with others
For detailed documentation, visit our Wiki.
# Read a PDF file
content = read_single_file("document.pdf")
# Read an Excel file with specific sheet
reader = AutoFileReader()
content = reader.read_file("spreadsheet.xlsx", sheet_name="Data")
# Process a folder of documents
results = doc_master(
folder_path="documents/",
output_type="dict"
)
The library includes comprehensive error handling:
try:
content = read_single_file("file.pdf")
except Exception as e:
print(f"Error processing file: {e}")
- Add OCR capabilities for image processing
- Support for additional file formats
- Performance optimizations for large files [multi-threading]
- Async file processing
- CLI interface
- Join our Discord server for discussions and support
- Check out our GitHub Issues for bug reports and feature requests
- Follow our GitHub Discussions for general questions
This project is licensed under the MIT License - see the LICENSE file for details.
- All our amazing contributors
- The open-source community
- The Swarm Corporation team
Made with ❤️ by The Swarm Corporation