readable-content

Collects actual content of any article, blog, news, etc. This project may not give perfect readable result depending upon way any webpage is structured, but it should work pretty well. This was created to prepare data for NLP work.

Installation

pip install readable-content

Usage

After installing you need to do just add following two variables in settings.py of your Scrapy project

from readable_content.parser import ContentParser
parser = ContentParser("https://ideas.ted.com/how-do-animals-learn-how-to-be-well-animals-through-a-shared-culture/")
content = parser.get_content()
print(readable_content)

In case the website does not allow getting the content and throws 4XX or 3XX or any other error codes, we can first get the HTML using other techniques like using requests, using user-agent, applying proxies on your own, etc. Then the html content can be passed as following:

parser = ContentParser("https://some_url.com", html_content)

Here html_content variable is string representation of the HTML.

Python migration and features added on existing work.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.rst

README.rst

readable-content

Installation

Usage

Files

README.rst

Latest commit

History

README.rst

File metadata and controls

readable-content

Installation

Usage