A powerful documentation scraper that uses LLM (Large Language Models) to intelligently process and summarize web documentation. Built with Streamlit for an easy-to-use interface, it allows you to scrape documentation from websites and get AI-enhanced summaries.
- 🌐 Web scraping with intelligent link detection
- 🤖 LLM-powered content summarization
- 📝 Markdown output format
- 💾 SQLite-based URL history
- 📁 Automatic file organization
- 🎯 Smart page title detection
- 🖥️ User-friendly Streamlit interface
- Python 3.8+
- just command runner
- OpenAI API key
-
Clone the repository:
git clone [email protected]:vialcollet/llm-doc-scraper.git cd llm-doc-scraper
-
Create a
.env
file with your OpenAI API key:echo "API_KEY=your-openai-api-key" > .env
-
Run the complete setup using just:
just setup
This will:
- Create a virtual environment
- Install required packages
- Start the Streamlit application
-
Start the application:
just run
-
Enter a documentation URL in the interface
-
Select the pages you want to scrape
-
Click "Scrape and Generate Document"
-
Find your processed documentation in the
docs
directory
just run
- Start the Streamlit applicationjust install
- Install required packagesjust setup
- Complete setup (venv, install, run)just init-db
- Initialize/reset the database schemajust clean-db
- Remove the URL history databasejust reset
- Reset everything (database, docs, venv)
The scraped documentation is:
- Saved in Markdown format
- Organized by page title
- Stored in the
docs
directory - Cleaned of navigation elements and non-documentation content
- Enhanced with AI-powered summarization
Feel free to open issues or submit pull requests if you have suggestions for improvements.
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.
This means you can freely use, modify, and distribute this software, but any modifications must also be released under the GPL-3.0 license. For more information, visit GNU GPL v3.