Skip to content

A powerful, lightweight Python library for automated file reading and content extraction. Doc Master simplifies the process of reading various file formats into string representations, making it perfect for data processing, content analysis, and document management systems.

License

Notifications You must be signed in to change notification settings

The-Swarm-Corporation/doc-master

Repository files navigation

Doc Master 📚

Join our Discord Subscribe on YouTube Connect on LinkedIn Follow on X.com

PyPI version License: MIT Python 3.8+ Discord

A powerful, lightweight Python library for automated file reading and content extraction. Doc Master simplifies the process of reading various file formats into string representations, making it perfect for data processing, content analysis, and document management systems.

🚀 Features

  • Universal File Reading: Seamlessly handle multiple file formats including:

    • PDF documents
    • Microsoft Word documents (.docx)
    • Excel spreadsheets
    • Text files
    • XML documents
    • Images (with base64 encoding)
    • Binary files
  • Smart Format Detection: Automatic file type detection and appropriate processing

  • Flexible Output: Choose between string or dictionary output formats

  • Batch Processing: Process entire folders of documents efficiently

  • Encoding Detection: Smart encoding detection for text files

  • Enterprise-Ready: Built with stability and performance in mind

📦 Installation

pip install -U doc-master

🔧 Quick Start

from doc_master import doc_master

# Read all files in a folder
results = doc_master(folder_path="path/to/folder", output_type="dict")

# Or read a single file
content = doc_master(file_path="path/to/file.docx")

📋 Requirements

  • Python 3.8+
  • pandas
  • pypdf
  • python-docx
  • Pillow

🤝 Contributing

We love your input! We want to make contributing to Doc Master as easy and transparent as possible. Here's how you can help:

  1. Fork the repo
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Check out our Contributing Guidelines for more details.

🌟 Support the Project

If you find Doc Master useful, please consider:

  • Starring the repository ⭐
  • Following us on GitHub
  • Joining our Discord community
  • Sharing the project with others

📖 Documentation

For detailed documentation, visit our Wiki.

Basic Usage Examples

# Read a PDF file
content = read_single_file("document.pdf")

# Read an Excel file with specific sheet
reader = AutoFileReader()
content = reader.read_file("spreadsheet.xlsx", sheet_name="Data")

# Process a folder of documents
results = doc_master(
    folder_path="documents/",
    output_type="dict"
)

🔍 Error Handling

The library includes comprehensive error handling:

try:
    content = read_single_file("file.pdf")
except Exception as e:
    print(f"Error processing file: {e}")

🛣️ Roadmap

  • Add OCR capabilities for image processing
  • Support for additional file formats
  • Performance optimizations for large files [multi-threading]
  • Async file processing
  • CLI interface

💬 Community and Support

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • All our amazing contributors
  • The open-source community
  • The Swarm Corporation team

Made with ❤️ by The Swarm Corporation

⭐ Star us on GitHub!

About

A powerful, lightweight Python library for automated file reading and content extraction. Doc Master simplifies the process of reading various file formats into string representations, making it perfect for data processing, content analysis, and document management systems.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published