BoilerPy3 is a native Python port of Christian Kohlschütter's Boilerpipe library, released under the Apache 2.0 Licence.
This package is based on sammyer's BoilerPy, specifically mercuree's Python3-compatible fork. This fork updates the codebase to be more Pythonic (proper attribute access, docstrings, type-hinting, snake case, etc.) and make use Python 3.6 features (f-strings), in addition to switching testing frameworks from Unittest to PyTest.
Note: This package is based on Boilerpipe 1.2 (at or before this commit), as that's when the code was originally ported to Python. I experimented with updating the code to match Boilerpipe 1.3, however because it performed worse in my tests, I ultimately decided to leave it at 1.2-equivalent.
To install the latest version from PyPI, execute:
pip install boilerpy3
If you'd like to try out any unreleased features you can install directly from GitHub like so:
pip install git+https://github.com/jmriebold/BoilerPy
The top-level interfaces are the Extractors. Use the get_content()
methods to extract the filtered text.
from boilerpy3 import extractors
extractor = extractors.ArticleExtractor()
# From a URL
content = extractor.get_content_from_url('http://www.example.com/')
# From a file
content = extractor.get_content_from_file('tests/test.html')
# From raw HTML
content = extractor.get_content('<html><body><h1>Example</h1></body></html>')
Alternatively, use get_doc()
to return a Boilerpipe document from which you can get more detailed information.
from boilerpy3 import extractors
extractor = extractors.ArticleExtractor()
doc = extractor.get_doc_from_url('http://www.example.com/')
content = doc.content
title = doc.title
Usually worse than ArticleExtractor, but simpler/no heuristics. A quite generic full-text extractor.
A full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor. Works very well for most types of Article-like HTML.
A full-text extractor which is tuned towards extracting sentences from news articles.
A full-text extractor which extracts the largest text component of a page. For news articles, it may perform better than the DefaultExtractor but usually worse than ArticleExtractor
A full-text extractor trained on krdwrd Canola. Works well with SimpleEstimator, too.
Dummy extractor which marks everything as content. Should return the input text. Use this to double-check that your problem is within a particular Extractor or somewhere else.
A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).