You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 22, 2020. It is now read-only.
between the Clinton emails and the Podesta leak, it seems to me that many document sets include a ton of copy-pasted news articles. By themselves, these are really boring and can obscure more interesting stuff. It'd be neat to classify/rank documents by whether they're mostly boilerplate (signatures, disclaimers) and news articles and therefore boring.
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
between the Clinton emails and the Podesta leak, it seems to me that many document sets include a ton of copy-pasted news articles. By themselves, these are really boring and can obscure more interesting stuff. It'd be neat to classify/rank documents by whether they're mostly boilerplate (signatures, disclaimers) and news articles and therefore boring.
The text was updated successfully, but these errors were encountered: