A content aggregator that collects metadata about articles from newspapers, journals, blogs, etc. The scraper uses information about the structure of the targeted pages as well as regular expressions in order to scrape selectively and filter the results. To facilitate the start of a new project, the data required by the scraper is extracted to files in the fixtures directory.
Inspired by AllTop.
Make sure Docker is installed on your system.
Clone the repository into a directory of your choice:
mkdir MYAPPDIR
git clone https://github.com/pi-sigma/nous-aggregator.git MYAPPDIR
Inside the new directory, create a file for the environment variables:
touch .env
Open the file with the editor of your choice and set the environment variables. See env-sample for instructions.
Build the Docker image:
docker-compose build
Start the web container in detached mode, apply the migrations, and initialize the database:
docker-compose up -d web
docker-compose run web python manage.py migrate
docker-compose run web python manage.py loaddata fixtures/sources.json
Create a superuser for the Django app:
docker-compose run web python manage.py createsuperuser
Stop the containers:
docker-compose stop web
docker-compose stop db
Start the Docker containers:
docker-compose up
You can access the page at one of the following addresses:
http://0.0.0.0:8000
http://127.0.0.1:8000
http://localhost:8000
If all went well, you should see the homepage of the app with a list of news sources arranged in a grid.
The grids are empty to begin with and fill up when the celery workers start
(depends on the schedule in scraper.tasks
).
In order to extract data about the sources from the database, use the following command while the web container is running (the commands for the other tables are analogous):
docker-compose run web python manage.py dumpdata articles.source --indent 2 > fixtures/sources.json