*** We will never ask for money to share the datasets. If someone claims that s/he has the all the raw data and wants a payment, please be careful. ***
We released a tool FakeNewsTracker, for collecting, analyzing, and visualizing of fake news and the related dissemination on social media. Check it out!
The latest dataset paper with detailed analysis on the dataset can be found at FakeNewsNet
Please use the current up-to-date version of dataset
Previous version of the dataset is available in branch named old-version
of this repository.
Complete dataset cannot be distributed because of Twitter privacy policies and news publisher copy rights. Social engagements and user information are not disclosed because of Twitter Policy. This code repository can be used to download news articles from published websites and relevant social media data from Twitter.
The minimalistic version of latest dataset provided in this repo (located in dataset
folder) include following files:
politifact_fake.csv
- Samples related to fake news collected from PolitiFactpolitifact_real.csv
- Samples related to real news collected from PolitiFactgossipcop_fake.csv
- Samples related to fake news collected from GossipCopgossipcop_real.csv
- Samples related to real news collected from GossipCop
Each of the above CSV files is comma separated file and have the following columns
id
- Unique identifider for each newsurl
- Url of the article from web that published that newstitle
- Title of the news articletweet_ids
- Tweet ids of tweets sharing the news. This field is list of tweet ids separated by tab.
Data download scripts are writtern in python and requires python 3.6 +
to run.
Twitter API keys are used for collecting data from Twitter. Make use of the following link to get Twitter API keys
https://developer.twitter.com/en/docs/basics/authentication/guides/access-tokens.html
Script make use of keys from tweet_keys_file.json file located in code/resources
folder. So the API keys needs to be updated in tweet_keys_file.json
file. Provide the keys as array of JSON object with attributes app_key,app_secret,oauth_token,oauth_token_secret
as mentioned in sample file.
Install all the libraries in requirements.txt
using the following command
pip install -r requirements.txt
FakeNewsNet contains 2 datasets collected using ground truths from Politifact and Gossipcop.
The config.json
can be used to configure and collect only certain parts of the dataset. Following attributes can be configured
-
num_process - (default: 4) This attribute indicates the number of parallel processes used to collect data.
-
tweet_keys_file - Provide the number of keys available configured in tweet_keys_file.txt file
-
data_collection_choice - It is an array of choices of various parts of the dataset. Configure accordingly to download only certain parts of the dataset.
Available values are
{"news_source": "politifact", "label": "fake"},{"news_source": "politifact", "label": "real"}, {"news_source": "gossipcop", "label": "fake"},{"news_source": "gossipcop", "label": "real"} -
data_features_to_collect - FakeNewsNet has multiple dimensions of data (News + Social). This configuration allows one to download desired dimension of the dataset. This is an array field and can take following values.
- news_articles : This option downloads the news articles for the dataset.
- tweets : This option downloads tweets objects posted sharing the news in Twitter. This makes use of Twitter API to download tweets.
- retweets: This option allows to download the retweets of the tweets provided in the dataset.
- user_profile: This option allows to download the user profile information of the users involved in tweets.
- user_timeline_tweets: This option allows to download upto 200 recent tweets from the user timeline. To download user's recent tweets, tweet objects needs to be downloaded first in order to identify users involved in tweets.
- user_followers: This option allows to download the user followers ids of the users involved in tweets. To download user followers ids, tweet objects need to be downloaded first in order to identify users involved in tweets.
- user_following: This option allows to download the user following ids of the users involved in tweets. To download user's following ids, tweet objects needs to be downloaded first in order to identify users involved in tweets.
Inorder to collect data set fast, code makes user of process parallelism and to synchronize twitter key limitations across mutiple python processes, a lightweight flask application is used as keys management server.
Execute the following commands inside code
folder,To download user profiles, tweet objects need to be downloaded first in order to identify users involved in tweets.
nohup python -m resource_server.app &> keys_server.out&
The above command will start the flask server in port 5000 by default.
Configurations should be done before proceeding to the next step !!
Execute the following command to start data collection,
nohup python main.py &> data_collection.out&
Logs are wittern in the same folder in a file named as data_collection_<timestamp>.log
and can be used for debugging purposes.
The dataset will be downloaded in the directory provided in the config.json
and progress can be monitored in data_collection.out
file.
The downloaded dataset will have the following folder structure,
├── gossipcop
│ ├── fake
│ │ ├── gossipcop-1
│ │ │ ├── news content.json
│ │ │ ├── tweets
│ │ │ │ ├── 886941526458347521.json
│ │ │ │ ├── 887096424105627648.json
│ │ │ │ └── ....
│ │ │ └── retweets
│ │ │ ├── 887096424105627648.json
│ │ │ ├── 887096424105627648.json
│ │ │ └── ....
│ │ └── ....
│ └── real
│ ├── gossipcop-1
│ │ ├── news content.json
│ │ ├── tweets
│ │ └── retweets
│ └── ....
├── politifact
│ ├── fake
│ │ ├── politifact-1
│ │ │ ├── news content.json
│ │ │ ├── tweets
│ │ │ └── retweets
│ │ └── ....
│ │
│ └── real
│ ├── poliifact-2
│ │ ├── news content.json
│ │ ├── tweets
│ │ └── retweets
│ └── ....
├── user_profiles
│ ├── 374136824.json
│ ├── 937649414600101889.json
│ └── ....
├── user_timeline_tweets
│ ├── 374136824.json
│ ├── 937649414600101889.json
│ └── ....
└── user_followers
│ ├── 374136824.json
│ ├── 937649414600101889.json
│ └── ....
└──user_following
├── 374136824.json
├── 937649414600101889.json
└── ....
News Content
news content.json
:
This json includes all the meta information of the news articles collected using the provided news source URLs. This is a JSON object with attributes including:
text
is the text of the body of the news article.images
is a list of the URLs of all the images in the news article web page.publish date
indicate the date that news article is published.
Social Context
tweets
folder:
This folder contains all tweets related to the news sample. This contains the tweet objects of the all the tweet ids provided in the tweet_ids attribute of the dataset csv. All the files in this folder are named as <tweet_id>.json
. Each <tweet_id>.json
file is a JSON file with format mentioned in https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html.
retweets
folder:
This folder contains the retweets of the all tweets posted sharing a particular news article. This folder contains files named as <tweet_id>.json
and it contains a array of the retweets for a particular tweets. Each object int the retweet array have format mentioned in https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-retweets-id.
user_profiles
folder:
This folder contains all the user profiles of the users posting tweets related to all news articles. This same folder is used for both datasources ( Politifact and GossipCop). It contains files named as <user_id>.json
and have JSON formated mentioned in https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object.html
user_timeline_tweets
folder:
This folder contains files representing the time line of tweets of users posting tweets related to fake and real news. All files in the folder are named as <user_id>.json
and have JSON array of upto 200 recent tweets of the users. The files have format mentioned same as https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline.html.
user_followers
folder:
This folder contains all the user followers ids of the users posting tweets related to all news articles. This same folder is used for both datasources ( Politifact and GossipCop). It contains files named as <user_id>.json
and have JSON data with user_id
and followers
attributes.
user_following
folder:
This folder contains all the user following ids of the users posting tweets related to all news articles. This same folder is used for both datasources ( Politifact and GossipCop). It contains files named as <user_id>.json
and have JSON data with user_id
and following
attributes.
If you use this dataset, please cite the following papers:
@article{shu2018fakenewsnet,
title={FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media},
author={Shu, Kai and Mahudeswaran, Deepak and Wang, Suhang and Lee, Dongwon and Liu, Huan},
journal={arXiv preprint arXiv:1809.01286},
year={2018}
}
@article{shu2017fake,
title={Fake News Detection on Social Media: A Data Mining Perspective},
author={Shu, Kai and Sliva, Amy and Wang, Suhang and Tang, Jiliang and Liu, Huan},
journal={ACM SIGKDD Explorations Newsletter},
volume={19},
number={1},
pages={22--36},
year={2017},
publisher={ACM}
}
@article{shu2017exploiting,
title={Exploiting Tri-Relationship for Fake News Detection},
author={Shu, Kai and Wang, Suhang and Liu, Huan},
journal={arXiv preprint arXiv:1712.07709},
year={2017}
}
(C) 2019 Arizona Board of Regents on Behalf of ASU