Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CC-News benchmark #600

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

CC-News benchmark #600

wants to merge 3 commits into from

Conversation

MaxDall
Copy link
Collaborator

@MaxDall MaxDall commented Aug 30, 2024

This PR introduces functionality to benchmark publishers using the CC-NEWS dataset.

The benchmarking process involves retrieving HTML and articles at specified intervals (daily, weekly, monthly, etc.) from the CC-NEWS dataset, assessing the completeness of the article extraction, and offering utility and statistical functions for operating on the benchmark. The goal is to detect any layout changes that occurred before the initial implementation of a specific parser and to provide the relevant HTML to address these changes.

@MaxDall MaxDall marked this pull request as draft August 30, 2024 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant