The Guardian News Article Collector
Collecting web articles from The Guardian using The Guardian Open Platform API exporting into a CSV file
Author: Faris Durrani
GitHub: https://github.com/farisdurrani/TheGuardianArticlesCollector
Prerequisites:
- Use Python 3.10
- Install requirements in
requirements.txt
Running:
- Get API key
from The Guardian Open Platform API
, putting the API key in a new
.env
file in the root directory as follows:API_KEY="00c0eb00-c0fe-4c1e-a312-000000"
- Run
python main.py
- See the results in new CSV files written to the
outputs/
directory
Below is one explicit option in main.py
to modify your search
preferences:
START_DATES
- Since the API limits the amount of articles that can be collected in each call, this lists the start and end dates of each API call ( recommended 3 months apart max). So the first call gets articles from datesSTART_DATES[0]
toSTART_DATES[1]
inclusive. The second call from datesSTART_DATES[1]
toSTART_DATES[2]
inclusive and so on.
See a sample output in sample_output.csv
The Python script make_sentiment.py uses the VaderSentiment library to compute the sentiment of a CSV of strings (e.g., the body text of The Guardian articles) and append the 4-column results to a copy of the CSV. In the script, simply:
- Add the names of the source CSVs to analyze;
- Modify the original directory of the source CSV filesin
ORIGINAL_SAMPLES_DIR
; - Modify the directory of the new CSV files in
SENTIMENTS_DIR
; and - Modify the
TARGET_COLUMN
containing the target strings.
Note: This script uses multiprocessing which makes it very fast but can be computation-heavy
See sample_output-sentiments.csv for a sample output.
The Guardian News Article Collector is MIT licensed, as found in the LICENSE file.
The Guardian News Article Collector documentation is Creative Commons licensed, as found in the LICENSE-docs file.