The main purpose is to help scientists from all over the world to estimate and analyze social opinion from social-media comments. The idea is to use text embedding algorithms to vectorize comments which then we can use for clustering, classification, dynamic analyzation and similarity comparison with reference text. The approach is tested on data from the Reddit platform.
Folder | Description |
preprocessing | preprocessing data downloaded from Reddit |
webapp | streamlit web_app |
sBert | testing sBert: vectorization, classification, cos_sim, clustering |
doc2vec | testing doc2vec: vectorization, classification, cos_sim |
USE | testing USE: vectorization, classification, cos_sim |
Run streamlit app in webapp folder.
- Install streamlit library
- Set environmet. Libraries listed in requirements.txt
- Put three tables ("df_doc2vec", "df_sbert", "df_use") in pickle format to the same folder with
Main columns names: body, vec, who.
In 'body' column comments with type string, in vec - embeddings, in who biden(1) or trump(0) type int. - Run streamlit app:
streamlit run
more info here
Results for sBert, doc2vec, USE
- similarity to reference text: 'We should build the wall!'
- classification to the right party
- clusters
- dynamic