This repository contains code to deploy Apache Airflow on heroku. In airflow, multiple jobs (DAGS) are used to scrape newspapers using the Python package Newspaper3k and insert them into a MongoDB.
Step 1 is only required if you wish to run the scraping dags. If you instead prefer to run your own dags, start from step 2
-
OPTIONAL Create an account here MongoDB
-
Create a user with Read/Write permissions
-
Generate the connection string for this user
-
Add the connection to the heroku_setup.sh here:
heroku config:set MONGO_DB= "HERE ADD YOUR MONGO DB CONNECTION STRING"
-
-
Register an account on https://www.heroku.com/
-
Login to heroku via terminal
heroku login
-
Configure and deploy airflow
bash heroku_setup.sh
-
Open
heroku open
-
Change the user pw
- Implement your dags in the dags folder
- Push your changes to master
git push heroku master
orgit push heroku subbranchname:master
As always, we did not reinvent the wheel, but benefited from multiple source out of which we can remember the following:
- Setting up airflow on heroku https://medium.com/@damesavram/running-airflow-on-heroku-ed1d28f8013d