Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notes #6

Open
bast opened this issue Apr 7, 2022 · 0 comments
Open

Notes #6

bast opened this issue Apr 7, 2022 · 0 comments

Comments

@bast
Copy link
Member

bast commented Apr 7, 2022

Later we can fragment these into separate issues, now just don't want this to get lost:

  • implement async fetching
  • test onion deduplication
  • encoding is not always UTF-8
  • keep track of paragraphs
  • save second-level domain
  • how to identify paywall
  • start with sitemap, collect everything, de-duplicate later
  • UiT UB Atekst
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant