Welcome! This is the repository used for the semester assignment in "Collecting and Analyzing Big Data" at KU Leuven (Acadmeic Year 2020-2021).
In the assignment, we wrote a short research paper in which we investigate the interrelation between the Bitcon price (BTC) and thread activity on the subreddit r/Bitcoin. In general, we wanted to investigate which impact the price of Bitcoin has on the thread activity, the texts put forward in the thread texts. The code put forward in this repository summarizes our work and analyses.
-
r/Bitcoin: https://www.reddit.com/r/Bitcoin/
-
Coindesk API: https://www.coindesk.com/coindesk-api
-
Coindesk documentation: https://pypi.org/project/coindesk/
-
Pushshift API: https://pushshift.io https://pushshift.io/api-parameters/
-
Pushshift repository: https://github.com/pushshift/api
-
Pushshift paper: https://ojs.aaai.org/index.php/ICWSM/article/view/7347/7201
During the course of this analysis, we employ several methods, both self-taught and taught during the class lectures. A few examples of those include:
- Predictive Modeling
- (Un-)Supervised Learning
- Working with Data in different formats (CSV, JSON)
- Working with APIs
- Text Mining
- Topic Modeling with Latent Dirichlet Allocation
In this repository, several files are present. We would briefly like to explain them:
- Data Retrieval.ipynb: A jupyter notebook for the retrieval of data (i.e., Bitcoin Price Index via Coindesk. Powered by Coindesk (https://www.coindesk.com/price/bitcoin). Leads to two datasets:
- bpi.csv: A csv file containing bitcoin price index.
- df_final.zip: A zip-folder containing df_final.csv, a csv file containing data from the reddit
- Exploratory Data Analysis.ipynb: A jupyter notebook for exploratory data analysis and a few visualizations
- Volatility.ipynb: A jupyter notebook for volatility analysis
- Text_Analysis.ipynb: A jupyter notebook for text analysis of the thread texts, includes several chapters
- text-analysis: A folder with several .py files, basically the same as Text_Analysis.ipynb, but in the .py file format, such that it can be executed from the command line.
- The remaining folders and files contain outputs, helper files etc.