Rust implementation of Fei Sun, Dandan Song and Lejian Liao paper:
Content Extraction via Text Density (CETD)
use dom_content_extraction::{DensityTree, get_node_text};
let dtree = DensityTree::from_document(&document); // &scraper::Html
let sorted_nodes = dtree.sorted_nodes();
let node_id = sorted_nodes.last().unwrap().node_id;
println!("{}", get_node_text(node_id, &document));
dtree.calculate_density_sum();
let extracted_content = dtree.extract_content(&document);
println!("{}", extracted_content;
Add it it with:
cargo add dom-content-extraction
or add to you Cargo.toml
dom-content-extraction = "0.3"
Check examples.
This one will extract content from generated "lorem ipsum" page
cargo run --example check -- lorem-ipsum
There is scoring example i'm trying to implement scoring. You will need to download GoldenStandard and finalrun-input datasets from:
https://sigwac.org.uk/cleaneval/
and unpack archives into data/
directory.
cargo run --example ce_score
As far as i see there is problem opening some files:
Error processing file 730: Failed to read file: "data/finalrun-input/730.html"
Caused by:
stream did not contain valid UTF-8
But overall extraction works pretty well:
Overall Performance:
Files processed: 370
Average Precision: 0.87
Average Recall: 0.82
Average F1 Score: 0.75
- implement normal scoring
- create real world dataset
- improve algo