This project demonstrates how to scrape articles from the Times of India (TOI) website using Python. It extracts relevant news content, such as headlines, article bodies, publication dates, and authors, and stores the data in a structured format for further analysis. The project also utilizes Transformers for summarizing articles and spaCy for data filtration.
This notebook aims to scrape and analyze articles from TOI. It includes steps to:
- Connect to the TOI website.
- Extract headlines, article texts, and metadata.
- Use spaCy for data filtration and processing of the extracted text.
- Utilize Hugging Face Transformers to summarize articles.
- Store the extracted data in a structured format for later use in data analysis or sentiment analysis.
To run the code, clone this repository and install the required dependencies:
git clone <repository-url>
cd <repository-directory>
pip install -r requirements.txt
- Open the Jupyter notebook file
Scrapping_TOI.ipynb
. - Run the cells in sequence to scrape articles from TOI.
- Modify the base URL, if needed, to target different sections of TOI.
- The extracted data will be filtered using spaCy and summarized using Transformers.
- The final data will be displayed or saved based on the configuration in the notebook.
# Example of running the notebook
python scrape_toi.py
- Python 3.x
- Jupyter Notebook
- Libraries:
requests
,BeautifulSoup
,pandas
,time
,re
,transformers
,spacy
You can install all the necessary libraries using the following command:
pip install -r requirements.txt
├── Scrapping_TOI.ipynb # Main notebook for scraping TOI articles
├── requirements.txt # Required Python libraries
└── README.md # Project documentation
Feel free to submit a pull request or create an issue if you have suggestions for improving the project.
This project is licensed under the MIT License - see the LICENSE file for details.