This project is an AI-based web-scraping agent designed to extract data from various websites efficiently. It leverages modern web technologies and machine learning algorithms to provide a robust and scalable solution for web scraping tasks.
- Automated Data Extraction: Automatically extracts data from specified websites.
- AI Integration: Uses AI algorithms to enhance data extraction accuracy.
- Scalability: Designed to handle large-scale web scraping tasks.
- Customizable: Easily configurable to target different websites and data points.
- Environment Management: Utilizes environment variables for secure API key management.
- CSV Data Handling: Supports scraping data from CSV files.
- Enhanced Error Handling: Improved error handling and logging mechanisms.
- Market Research: Gather data from competitors' websites for market analysis.
- Price Monitoring: Track prices of products across various e-commerce platforms.
- Content Aggregation: Aggregate content from multiple sources for news or blog websites.
- Data Mining: Extract large datasets for machine learning and data analysis.
- Node.js (v14 or higher)
- Python (v3.8 or higher)
- pip (Python package installer)
- virtualenv (Python virtual environment tool)
-
Clone the Repository:
git clone https://github.com/justAbhinav/AI-based-web-scraping-agent cd AI-based-web-scraping-agent/backend
-
Create and Activate Virtual Environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install Dependencies:
pip install -r requirements.txt
-
Set Up Environment Variables: Create a
.env
file in the topmostbackend
directory and add your API keys and other environment variables:GEMINI_API_KEY= # Your Gemini API Key SERPAPI_API_KEY= # Your SerpApi API Key
-
Run the Backend Server:
python app.py
-
Navigate to Frontend Directory:
cd ../frontend
-
Install Dependencies:
npm install
-
Set Up Environment Variables: Create a
.env
file in thefrontend
directory and add your environment variables:REACT_APP_GOOGLE_CLIENT_ID=your-google-client-id REACT_APP_GOOGLE_API_KEY=your-google-api-key
-
Run the Frontend Server:
npm start
You may use the provided testing.csv
and a prompt like: "What is the annual income of these companies?" to test the application. The application will scrape the data from the provided CSV file and display the results on the frontend.
- Documentation: Detailed documentation is available in the
docs
directory. - Contributing: Contributions are welcome!
- License: This project is licensed under the MIT License. See the
LICENSE
file for details.
For common issues and troubleshooting steps, please refer to the Troubleshooting Guide.
This project is licensed under the MIT License. See the LICENSE file for details.
For any questions or inquiries, please contact [email protected].