Skip to content

Latest commit

 

History

History
177 lines (154 loc) · 6.37 KB

README.md

File metadata and controls

177 lines (154 loc) · 6.37 KB

maudlin

Maudlin is a news aggregator, sentiment analyzer, and topic tracker (so far!). It scrapes the front page of most news sites and analyzes the headlines for insights about events and the election and the way the news is talking about them.

The current app is available at https://maudlin.news.

Basic Premise

The idea I'm working with is a successor to the first Maudlin (now defunct) that can be found here: https://github.com/mas-4/maudlin. Its problems were manifold, chief among them that it was a web app! It generated pages on access instead of just generating a static site, which this new version does. It also used Flask and Scrapy and this new one is 90% rolled by me. (tech stack below)

The biggest change from the earlier version besides the flask system is now we don't scrape articles. We're headlines only. This limits the difficulty in maintaining 114 different scrapers. It also limits the amount of data I need to store.

I am using www.mediabiasfactcheck.com to get a partisan score and a factuality score which will allow me to do some more interesting metrics. And I'm trying my best to remove useless words.

Tech Stack

I'm using requests, selenium with headless Firefox, and BeautifulSoup to do all my scraping. I built my own spider framework basically and I think its shockingly clean! It runs all the spiders at once in a multithreaded environment. The scrape and build is kicked off every half hour last time I updated this README.

The data is stored in an sqlite database using SQLAlchemy 2.0+ as an ORM.

I use nltk, punkt, vader_lexicon, AFINN, and averaged_perceptron_tagger for all my sentiment analysis.

I've started analyzing articles for handcrafted topics, focusing on topics relevant to the election, like Biden is Old or Trump Trials. This topic analysis does not use LDA or K-Means or any off the shelf algorithm, it relies on bags of words and similarity scores.

wordcloud to make the wordclouds.

And jinja2 is used for templating.

There's a lot of pandas and numpy in there at this point. Some gensim I think. And textacy/spacy. Matplotlib and Seaborn of course. gridJS for tables. ChatGPT Plus came up with css styling and general debugging.

I've experimented with a lot of different models for toppic modeling and story discovery. Finally got story discovery working by using an agglomerative approach with cosine_similarity (sklearn) and strict cluster requirements (at least n number of samples from different news agencies with cosine similarity scores over 0.5).

I plan on adding some more sophisticated sentiment scoring, and using hugging face models for text preprocessing and summarization.

I've actually read the better part of two books in the course of making this thing, Blueprints for Text Analytics Using Python: Machine Learning-Based Solutions for Common Real World (NLP) Applications and The Handbook of NLP with Gensim: Leverage topic modeling to uncover hidden patterns, themes, and valuable insights within textual data

Oh! And the site is hosted on netlify. It's just a bunch of flat files I upload to netlify.

Testing

Not really sure how to do testing. I guess the site builder could be tested but meh. I'm a pretty TDD guy but kinda hard to test scrapers. I had a test suite and have abandoned it. Sites change, scrapers have to be updated. I'd rather add in some features to get a good sense of what's going wrong. I have a daily report system that gets emailed to me every morning and I keep extensive logs and daily backups. Over the weekend (end of March) my entire machine went down so I missed a day of articles. But that's what happens when you run this thing out of your garage.

Sites to add

If you know of a good source not on this list please open an issue or a PR! Feel free to make new scrapers!

  • The Nation
  • Propublica
  • MarketWatch
  • News Nation
  • The Post Millenial
  • TPM
  • Christian Science Monitor
  • Bulwark
  • Dispatch
  • The Daily Caller
  • Gateway Pundit
  • Washington Free Beacon
  • The Washington Times
  • Townhall
  • Washington Examiner
  • Independent Journal Review
  • alternet
  • One America News
  • Newsmax
  • ABC News
  • Al Jazeera
  • Associated Press
  • Axios
  • BBC
  • Barron
  • Bloomberg
  • Breitbart
  • Business Insider
  • CBC
  • CBS News
  • CNBC
  • CNN
  • Caixin Global
  • Chicago Tribune
  • Crooks and Liars
  • Current Affairs
  • Daily Beast
  • Daily Kos
  • Daily mail
  • Der Spiegel
  • Economist
  • FT
  • Forbes
  • Foreign Affairs
  • Foreign Policy
  • Fortune
  • Fox Business
  • Fox News
  • France 24
  • Global Times
  • Google News
  • Hindustan Times (Chromium)
  • Huffington Post
  • India Times
  • Infowars (403)
  • Jacobin
  • Japan Times
  • Kyiv Independent
  • LA Times
  • Le Monde
  • MSNBC (requires more sophisticated filtering)
  • Military.com
  • Moscow Times
  • Mother Jones
  • NBC
  • NPR
  • National Interest
  • National Post
  • National Review (javascript only)
  • New Republic
  • New York Magazine
  • New York Post
  • New Yorker
  • Newsweek (403)
  • Nikkei Asia
  • PBS News Hour
  • Political Wire
  • Politico (javascript)
  • Punchbowl
  • Quillette
  • RT
  • Radio Free Europe
  • Raw Story
  • Real Clear Politics (Assholes)
  • Reason
  • Red State
  • Reuters (401?)
  • Rolling Stone
  • Salon
  • Scripps
  • Semafor
  • Sky News
  • Slate
  • South China Morning Post
  • Star Tribune
  • Strait Times
  • Sydney Morning Herald
  • Taipei Times
  • Tampa Bay Times
  • Telegraph
  • The Atlantic
  • The Blaze
  • The Daily Wire
  • The Epoch Times
  • The Federalist
  • The Globe and Mail
  • The Guardian
  • The Hill
  • The Independent
  • The Intercept
  • The New York Times
  • The Sun
  • The Times of India
  • The Week
  • Time
  • Toronto Sun
  • USA Today
  • VOA
  • Vanity Fair
  • Vice
  • Vox
  • Wall Street Journal (401/403)
  • Washington Post
  • Winnipeg Free Press
  • Xinhua
  • indianexpress.com
  • livemint.com
  • ndtv.com
  • news.yahoo.com
  • news18.com