A tool for evaluating extractions from webpages. Evaluate article, price, and language extractions from html.
- a simple and slick CLI built with Typer
- run & evaluate only the extractors your want
- Reduced boilerplate. extractors are autoloaded:
- add new extractors with just a few lines of code
- easily write custom extractors to benchmark
- benchmark extraction in parallel (with Dask) or sequentially (you maybe surprised how the throughput changes!)
- support for multiple methods of evaluating extractions
- supports for multiple types of extractions
- Coming soon: Language benchmarks
- Coming soon: Title benchmarks
- Coming soon: Product benchmarks
- Coming soon: Support for custom datasets
- Requirements:
- pip-22.3
- python>=3.8
- clone the repo
git clone https://github.com/Nootka-io/wee-benchmarking-tool.git
cd
into the directory you cloned the repo to- create a venv
python3 -m venv venv
- activate venv
source venv/bin/activate
pip install -e .
Wow! Here's my initial impression after writing this library to determine some issues I experienced running in parallel. Some extremly fast libraries slow down, and one of the slower ones, makes the biggest improvement.
Sequential Results
Similarity Threshold Results - classified as successful if the similarity of the extraction was greater than 90% compared to the ground truth
Library | Accuracy | Precision | Recall | FScore | Mean Similarity | Items/sec |
---|---|---|---|---|---|---|
boilerpy3 | 0.4033 | 0.4033 | 0.4033 | 0.4033 | 0.7506 | 57.5429 |
goose3 | 0.6796 | 0.6796 | 0.6796 | 0.6796 | 0.8344 | 9.8552 |
inscriptis | 0.0331 | 0.0331 | 0.0331 | 0.0331 | 0.5092 | 74.6064 |
news-please | 0.558 | 0.558 | 0.558 | 0.558 | 0.812 | 4.8268 |
newspaper3k | 0.7845 | 0.7845 | 0.7845 | 0.7845 | 0.8855 | 7.6327 |
resiliparse-plain | 0.0884 | 0.0884 | 0.0884 | 0.0884 | 0.6054 | 776.8351 |
resiliparse | 0.6298 | 0.6298 | 0.6298 | 0.6298 | 0.8819 | 505.9411 |
trafilatura | 0.5304 | 0.5304 | 0.5304 | 0.5304 | 0.8446 | 36.6354 |
Complex Score Results - comparing tokens from both ground truth and prediction
Library | Accuracy | Precision | Recall | FScore | Mean Similarity | Items/sec |
---|---|---|---|---|---|---|
boilerpy3 | 0.6381 | 0.8412 | 0.8743 | 0.8373 | 0.7506 | 57.5429 |
goose3 | 0.6276 | 0.9283 | 0.8561 | 0.8755 | 0.8344 | 9.8552 |
inscriptis | 0.6706 | 0.4561 | 0.9711 | 0.5869 | 0.5092 | 74.6064 |
news-please | 0.638 | 0.9105 | 0.8952 | 0.8861 | 0.8133 | 4.8268 |
newspaper3k | 0.6443 | 0.9281 | 0.9139 | 0.9041 | 0.8868 | 7.6327 |
resiliparse-plain | 0.6793 | 0.492 | 0.9965 | 0.6253 | 0.6054 | 776.8351 |
resiliparse | 0.6754 | 0.8529 | 0.9852 | 0.904 | 0.8819 | 505.9411 |
trafilatura | 0.6602 | 0.8975 | 0.9485 | 0.9113 | 0.8446 | 36.6354 |
Parallel Results
Metrics are the same, only timings change.
Library | Items/sec - dask bag | Items/sec - multiprocessing pool |
---|---|---|
boilerpy3 | 58.7689 | 378.1217 |
goose3 | 29.6042 | 59.5797 |
inscriptis | 59.1145 | 421.9863 |
news-please | 15.1085 | 24.9134 |
newspaper3k | 14.3891 | 23.7028 |
resiliparse-plain | 64.1965 | 804.3277 |
resiliparse | 63.8425 | 801.3418 |
trafilatura | 43.1901 | 263.4183 |
Notes:
- the items/sec metric will vary depending on available cores, memory, and more.
- timings are "wall clocks" and purposely include overhead of the frameworks and serialization times. Different methods distributing the work in parallel the work can have different results. Eg, chunking the data or loading forma shared memory space. It's not ment to profile the inner workings of the frameworks, running a library like scalene is recommended for that, and an excellent tool for profiling python apps.
- Dask bag adds significant overhead. It's not clear to me at this time if this is a result of overhead from the use of
concurrent.futures.ProcessPoolExecutor
under the hood or dask directly.- sometimes this can result in slower results than sequential. It's my recommendation to run in parallel with pythons multiprocessing pool, and break work in chunks or use a messaging queue to solve distributed computing.
- Multiprocessing pool is always faster than dask in these benchmarks but may have some memory issues that are not apparent here. They are out side the scope of this article but take a look at this, https://luis-sena.medium.com/understanding-and-optimizing-python-multi-process-memory-management-24e1e5e79047, blog post for a better understanding and the official docs, https://docs.python.org/3/library/multiprocessing.shared_memory.html.
ToDo: Provide more insights into the metrics and how the metrics are calculated.
Quickstart:
wee-cli run
- this will run extractions and evaluations sequentially and output the results to the terminal.
See DOCS.md for the complete typer CLI documentation. or use wee-cli --help
Only one file needs to be added to wee_cli/extractors/
. The title should be run_[THE_NAME_OF_THE_EXTRACTOR].py. Make sure to change the name
parameter at the start of the class, and extend the BaseExtractor
class implementing the extract()
method.
See: ../wee_cli/extractors/ for an examples
ToDo: add better documentation and examples
- structured data markup extraction benchmarks
- language extraction benchmarks
- product extraction benchmarks
- ability to run various extraction tests
- Support for adding different metrics easier
- support different dataset formats, like; prodigy, and label-studio
- parallel evaluation of results
- provide longer better real world examples
- evaluate other metrics
- write tests
- package (on hold since newspaper is installed from git)
- support different dataset formats, like; prodigy, and label-studio
- run evaluations in parallel
- store evaluation results (likely in sqlite)
- how to support anything available in scehma.org markup, and throughput of any schema extractor.
- export tables as MD
- get rid of goose3 terrible logging