- Clone repo
git clone https://github.com/YazidKHALDI/bbc-covid-news-analysis.git
- Run
cd bbc-covid-news-analysis
- Run
python3.9 -m venv python_env; . python_env/bin/activate
- Run
mkdir -p ./logs ./plugins
- Run
pip install "apache-airflow[celery]==2.5.1" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.5.1/constraints-3.9.txt"
(Chnage 3.9 with your python version) - Run
cd api
and runpip install -r requirements.txt
- Run
cd ..
and runecho -e "AIRFLOW_UID=$(id -u)" > .env
(only on first use) - Run
cd docker_build
and rundocker compose build
- Run
cd ..
and rundocker-compose up airflow-init
(only on first use) - Run
docker compose up -d
then access airflow on http://localhost:8080 and access to mongodb GUI with http://localhost:4321/ - Run
cd api
and runuvicorn app:app --reload
then get articls from api using http://localhost:8000/ to get all articls. - Run
cd app
and runnpm i
, thennpm run build
- Run
node server.js
then access the web application on http://localhost:6000/
To scrap the data of BBC's new articles using airflow I decide to run airflow using docker, and I used DockerFile to extend Image and install all requirements needed for scrapping.
FROM apache/airflow:2.5.1-python3.9
COPY requirements.txt .
RUN pip install --user --upgrade pip
RUN pip install --no-cache-dir --user -r requirements.txt
and the script scraper/main.py
mounted in - ${AIRFLOW_PROJ_DIR:-.}/scraper:/opt/airflow/scraper
so we can run our scraping script using airflow DAGS.
in The scraper code, we have 2 parts, scraping home page articles scrap_topos_component()
, and the last news article scrap_latest_updates()
with a start date and end date as a parameter.
def scrap_corna_news_by_date(start_date, end_date):
print("Start scrapping Data \nfrom\t"+str(datetime.fromtimestamp(start_date).strftime("%Y-%m-%d %H:%M:%S")) +"\nto\t"+str(datetime.fromtimestamp(end_date).strftime("%Y-%m-%d %H:%M:%S")))
print("step 1 : scarping of top artics")
scrap_topos_component(start_date, end_date)
print("step 1 Done")
print("step 2 : scarping of Latest Updates artics")
scrap_latest_updates(start_date, end_date)
print("step 2 Done")
the goal here is to get the URL of the article and then scrap its content, here we have two types of URL, the URLs of the first section in the home page then the section of Latest Updates For the first section, I scraped the HTML to get the URL that respects the date range
def get_topos_component_urls_by_date(start_date=None, end_date=None):
url = "https://www.bbc.com/news/coronavirus"
response = requests.get(url)
if response.status_code == 404:
exit('Page cannot be found')
soup = BeautifulSoup(response.content, "html5lib")
return {url_data['url']: url_data for url_data in [{"url": base_url + str(link.get('href')), "date": get_lisnk_date(link)} for link in soup.find('div', id="topos-component").findAll('a') if "gs-c-section-link--truncate" not in link['class'] if get_lisnk_date(link) is not None and (start_date <= get_lisnk_date(link) <= end_date) ]}.values()
but for the Latest Updates I found API URL to git the list of articles paginated example page 6
def scrap_latest_updates(start_date=None, end_date=None):
...
response = requests.get("https://push.api.bbc....")
if response.status_code == 404:
continue
res = response.json()
...
if ("url" in article) and article['type'] == 'STY':
article_url = base_url+str(article['url'])
...
then we come to the main function def scrap_article_body_by_url(url):
the function which accepts the url and return dict
:
to scrap the data I found this javascript variable contains all informations about the article as unparsed JSON.
<script nonce="">
window.__INITIAL_DATA__="{\"data\":{\"chameleon-global-navigation?country=ma&language=en-GB\":{\"name........}}";
</script>
the result is this object :
{
"menu" : article_menu,
"submenu" : article_sub_menu,
"title" : article_title,
"date" : article_date,
"date_timestamp" : article_date_fromtimestamp,
"main_image" : article_main_img,
"authors" : article_authors,
"topics" : article_topics,
"article_images" : article_images,
"text" : article_text,
"url" : url,
"countries" : countries if countries else None,
"subjectivity" : subjectivity,
"polarity" : polarity
}
and I extract the list of countries using GeoText
and sentiment from the article using TextBlob
.
The final step is to insert data to MongoDB
, the service that I add to docker-compose.yaml
so to communicate with DB we use pymongo
def get_db():
CONNECTION_STRING = "mongodb://root:password@mongodb:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.6.2"
client = MongoClient(CONNECTION_STRING)
return client.bcc_articles
After the extraction of the article we insert the data to collection
def insert_articles_to_db(article_record):
if article_record is not None:
db = get_db()
collection = db.articles
collection.insert_one(article_record)
global counter
counter = counter + 1
print("[Done]date: "+str(article_record['date'])+"\ttitle : "+str(article_record['title'])+"\tulr : "+ str(article_record['url'])+"\n")
to automate the scrapping of articles we create two DAGS, the first schedule @once
and it scraps all coronavirus news in the last 100 days.
with DAG(
"scrapignlast_last_100_day_news",
default_args={
...
},
description="Scraping new articles every day for 5 days continuously",
schedule="@once",
start_date=datetime.now(),
) as dag:
t1 = BashOperator(
task_id="scrapignlast_news",
bash_command="python /opt/airflow/scraper/main.py"
)
the second DAG to scrap the news @daily
the articles of the last day
with DAG(
"scrapignlast_news_daily",
default_args={
...
},
...
schedule="@daily",
start_date=datetime.now(),
...
) as dag:
t1 = BashOperator(
task_id="scrapignlast_news_daily",
bash_command="python /opt/airflow/scraper/main.py --daily"
)
then we need to activate the 2 DAGs, the first one will execute once to store all articles of the last 3 months on DB, then activate the second DAG that will execute daily.
You can check the scraping logs of DAG
you can access mongodb GUI on http://localhost:4321/
After the data are collected I created a small API using FastAPI
the goal to get all articles
.
http://localhost:8000/
Or the article of specific date ranger
for example from 01/01/2023 to 01/02/2023.
http://localhost:8000/?start_date=2023-01-01&end_date=2023-02-01
the results is a this json object:
{
"articles": [],
"frequency_of_topics": {
"Xi Jinping": 5,
"China": 67,
"Coronavirus pandemic": 265,
},
"frequency_of_authors": {
"Stephen McDonell": 4,
"Tessa Wong and Nathan Williams": 1,
"Katie Razzall": 2,
"Tessa Wong": 2,
},
"frequency_of_countries": {
"China": 81,
"Singapore": 8,
"Australia": 11,
"Hong Kong": 21,
},
"author_by_topics_subjectivity_polarity": {
"Lola Mayor": {
"topics": [
"Brecon",
"Cardiff",
"Wrexham",
"Brecon Beacons",
"Canada",
"Social media",
"Welsh government",
"TikTok",
"Social media influencers"
],
"subjectivity": 0.6215147392290248,
"polarity": 0.32200027485741783,
"counter": 1
}
},
"count_articls_per_day": [
{
"date": 1669935600,
"count": 4
},
{
"date": 1670022000,
"count": 2
},
{
"date": 1670108400,
"count": 1
}
...
],
"unique_list_of_topics_relations": [
[
"Coronavirus lockdown measures",
"Aldridge",
1
],
[
"Charities",
"Scottish government",
1
],
[
"Economic growth",
"Food",
1
]..
]
}
"articles"
: it's a simple list of articles from DB"frequency_of_topics"
: Count number of topics in articles"frequency_of_authors"
: Count number of authors in articles"frequency_of_countries"
: Count number of countries in articles"author_by_topics_subjectivity_polarity"
: Thetopics
of every author and the average ofsubjectivity
andpolarity
"count_articls_per_day"
: Number of articles published per day, the publish date represented astimestamp
."unique_list_of_topics_relations"
: List of the links between all topics."start_date_data"
: Lowest date between published news articles."end_date_data"
: Highest date between published news articles.
For the Visualization I decided to use Highcharts.
After you visit the http://localhost:6060/ without specifying the date range you will see the information about all articles on DB and then you can select the date range needed using the date picker.
The first chart is a visualization of the frequency of articles published, with the possibility of zooming on a specific period.
The second part is about Authors
, a simple chart showing the number of articles written per author, and sentiment analysis for all new articles per author to calculate the average value of polarity and subjectivity.
The third part is about Topics
, a simple chart showing the number of topics appear in article, and worldCloud of topics word more used.
I added a Dependency wheel to visualize the relationship between every topic used in the same article, but the Dependency wheel shows only if the links between topics are less than 1000
because it's hard to read the chart if there is a lot of relationships.
the fourth part is about countries
, a simple chart showing the number of countries that appears in the article, and a map visualization.