LLM Doc Scraper

A powerful documentation scraper that uses LLM (Large Language Models) to intelligently process and summarize web documentation. Built with Streamlit for an easy-to-use interface, it allows you to scrape documentation from websites and get AI-enhanced summaries.

Features

🌐 Web scraping with intelligent link detection
🤖 LLM-powered content summarization
📝 Markdown output format
💾 SQLite-based URL history
📁 Automatic file organization
🎯 Smart page title detection
🖥️ User-friendly Streamlit interface

Prerequisites

Python 3.8+
just command runner
OpenAI API key

Setup

Clone the repository:

git clone [email protected]:vialcollet/llm-doc-scraper.git
cd llm-doc-scraper

Create a .env file with your OpenAI API key:

echo "API_KEY=your-openai-api-key" > .env

Run the complete setup using just:
```
just setup
```
This will:
- Create a virtual environment
- Install required packages
- Start the Streamlit application

Usage

Start the application:
```
just run
```
Enter a documentation URL in the interface
Select the pages you want to scrape
Click "Scrape and Generate Document"
Find your processed documentation in the docs directory

Available Commands

just run - Start the Streamlit application
just install - Install required packages
just setup - Complete setup (venv, install, run)
just init-db - Initialize/reset the database schema
just clean-db - Remove the URL history database
just reset - Reset everything (database, docs, venv)

Output

The scraped documentation is:

Saved in Markdown format
Organized by page title
Stored in the docs directory
Cleaned of navigation elements and non-documentation content
Enhanced with AI-powered summarization

Contributing

Feel free to open issues or submit pull requests if you have suggestions for improvements.

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

This means you can freely use, modify, and distribute this software, but any modifications must also be released under the GPL-3.0 license. For more information, visit GNU GPL v3.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
justfile		justfile
prompt.md		prompt.md
requirements.txt		requirements.txt
scrape_ui.py		scrape_ui.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Doc Scraper

Features

Prerequisites

Setup

Usage

Available Commands

Output

Contributing

License

About

Releases 1

Packages

Languages

License

vialcollet/llm-doc-scraper

Folders and files

Latest commit

History

Repository files navigation

LLM Doc Scraper

Features

Prerequisites

Setup

Usage

Available Commands

Output

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages