The Guardian News Article Collector

Collecting web articles from The Guardian using The Guardian Open Platform API exporting into a CSV file

Author: Faris Durrani
GitHub: https://github.com/farisdurrani/TheGuardianArticlesCollector

How to Use

Prerequisites:

Use Python 3.10
Install requirements in requirements.txt

Running:

Get API key from The Guardian Open Platform API , putting the API key in a new .env file in the root directory as follows:
```
API_KEY="00c0eb00-c0fe-4c1e-a312-000000"
```
Run python main.py
See the results in new CSV files written to the outputs/ directory

Options

Below is one explicit option in main.py to modify your search preferences:

START_DATES - Since the API limits the amount of articles that can be collected in each call, this lists the start and end dates of each API call ( recommended 3 months apart max). So the first call gets articles from dates START_DATES[0] to START_DATES[1] inclusive. The second call from dates START_DATES[1] to START_DATES[2] inclusive and so on.

Sample Output

See a sample output in sample_output.csv

Bonus: Sentiment Analysis

The Python script make_sentiment.py uses the VaderSentiment library to compute the sentiment of a CSV of strings (e.g., the body text of The Guardian articles) and append the 4-column results to a copy of the CSV. In the script, simply:

Add the names of the source CSVs to analyze;
Modify the original directory of the source CSV filesin ORIGINAL_SAMPLES_DIR;
Modify the directory of the new CSV files in SENTIMENTS_DIR; and
Modify the TARGET_COLUMN containing the target strings.

Note: This script uses multiprocessing which makes it very fast but can be computation-heavy

See sample_output-sentiments.csv for a sample output.

License

The Guardian News Article Collector is MIT licensed, as found in the LICENSE file.

The Guardian News Article Collector documentation is Creative Commons licensed, as found in the LICENSE-docs file.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github		.github
.idea		.idea
outputs		outputs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Guardian News Article Collector

How to Use

Options

Sample Output

Bonus: Sentiment Analysis

License

About

Releases

Packages

Languages

License

farisdurrani/TheGuardianArticlesCollector

Folders and files

Latest commit

History

Repository files navigation

The Guardian News Article Collector

How to Use

Options

Sample Output

Bonus: Sentiment Analysis

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages