Topic modelling on JRE podcast transcripts and recommendations made with cosine similiarity. Project deployed on heroku at https://jre-recommender.herokuapp.com/.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
BeautifulSoup pymongo sklearn pandas nltk SpaCy gensim sqlalchemy psycopg2
The data is on podgist and podscribe. Check their robots.txt
, at the time of my scraping, it was not banned. Scraping is in 01_scrape_transcripts.
Set up your credentials in new file named cred.py in credentials in the form:
mongo_user = 'your username'
mongo_pass = 'your password'
mongo_host = 'your server:port number'
mongo_auth_db = 'authorizing DB for your user'
sql_host = 'your postgres server'
sql_port = your posgres port
sql_user = 'postgres username'
sql_pass = 'postgres password'
heroku_DB = 'postgres DB from heroku, take a look at your heroku settings'
Data is cleaned from a MongoDB database in 02_clean_transcripts.
Topics are modelled in 04_topic_modelling.
- Eric Bassett - Initial work - github