NLP Unsupervised ML project
The goal of this project is build a recommendation system for scientists and researchers to navigate the current surge of papers about COVID-19, find what is relevant to their work, and uncover the hidden semantic relationships. Using the COVID-19 Open Research Dataset, I used the abstract of the subset of articles from January 2020 to May 2021 (about 260,000 articles) as text in this project. With the LDA model, I assigned each documents with dominant topic and their relevance to the topic and grouped articles by topics for recommendation system. So researchers can look up articles based on topic that is related to their work. Lastly, I deployed a Strealit app on Heroku with a smaller dataset that recommends top 20 related articles for the selected topic.
To learn more, see my blog post and presentation slides
The topic model visualization with pyLDAvis is saved as a html file, you can download it from here to see.
Try out the Heroku app for COVIPEDIA~
- Code (in Workflow Folder)
- Streamlit app on Heroku
- main python file
- Procfile, setup doc, required library for Heroku
- Dataset used in app
- Streamlit app on Heroku
- Python (pandas, numpy)
- langdetect
- regex, string
- spaCy, scispaCy ("en_core_sci_lg" model for biomedical, scientific, and clinical vocabulary)
- NLTK
- Scikitlearn
- Gensim
- WordCloud
- pyLDAvis
- Streamlit
- Heroku