Skip to content

Scrape and parse code and library documentation, providing structured data for large language models (LLMs) and coding agents.

License

Notifications You must be signed in to change notification settings

vialcollet/llm-doc-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Doc Scraper

A powerful documentation scraper that uses LLM (Large Language Models) to intelligently process and summarize web documentation. Built with Streamlit for an easy-to-use interface, it allows you to scrape documentation from websites and get AI-enhanced summaries.

Features

  • 🌐 Web scraping with intelligent link detection
  • 🤖 LLM-powered content summarization
  • 📝 Markdown output format
  • 💾 SQLite-based URL history
  • 📁 Automatic file organization
  • 🎯 Smart page title detection
  • 🖥️ User-friendly Streamlit interface

Prerequisites

  • Python 3.8+
  • just command runner
  • OpenAI API key

Setup

  1. Clone the repository:

    git clone [email protected]:vialcollet/llm-doc-scraper.git
    cd llm-doc-scraper
  2. Create a .env file with your OpenAI API key:

    echo "API_KEY=your-openai-api-key" > .env
  3. Run the complete setup using just:

    just setup

    This will:

    • Create a virtual environment
    • Install required packages
    • Start the Streamlit application

Usage

  1. Start the application:

    just run
  2. Enter a documentation URL in the interface

  3. Select the pages you want to scrape

  4. Click "Scrape and Generate Document"

  5. Find your processed documentation in the docs directory

Available Commands

  • just run - Start the Streamlit application
  • just install - Install required packages
  • just setup - Complete setup (venv, install, run)
  • just init-db - Initialize/reset the database schema
  • just clean-db - Remove the URL history database
  • just reset - Reset everything (database, docs, venv)

Output

The scraped documentation is:

  • Saved in Markdown format
  • Organized by page title
  • Stored in the docs directory
  • Cleaned of navigation elements and non-documentation content
  • Enhanced with AI-powered summarization

Contributing

Feel free to open issues or submit pull requests if you have suggestions for improvements.

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

This means you can freely use, modify, and distribute this software, but any modifications must also be released under the GPL-3.0 license. For more information, visit GNU GPL v3.

About

Scrape and parse code and library documentation, providing structured data for large language models (LLMs) and coding agents.

Resources

License

Stars

Watchers

Forks

Packages

No packages published