A clean, continuous, real-time aggregated stream of semantically enriched news articles from RSS-enabled sites across the world.
The pipeline performs the following main steps:
- Periodically crawl a list of RSS feeds and a subset of Google News and obtain links to news articles
- Download the articles, taking care not to overload any of the hosting servers
- Parse each article to obtain
- Potential new RSS sources mentioned in the HTML, to be used in step (1)
- Cleartext version of the article body
- Enrich the articles with semantic annotations. (using enrycher.ijs.si)
- Expose the stream of news articles to end users.
See newsfeed.ijs.si.
BSD-3; see the LICENSE
file.