-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character encoding issues in boilerplate processing #29
Comments
After downloading and looking at the original data set, it turns out that the character set decoding being done wrong on the input side. The output looks like it is correctly writing UTF-8, but the characters are already corrupted by then. In the particular case of 105.html the source encoding is |
And to follow up on my last comment, this only affects the standalone program, not the Hadoop processing. Rather than allowing JSoup do the character set determination, I decided to keep the API the same and use the same character encoding detection that the Hadoop processing does in the standalone program. I've got a PR with fixes for all the boilerplate problems that I've seen (and some related stuff). |
The output from the boilerplate processeor, e.g. /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt, appears to use a character encoding other than UTF-8. This causes strings such as Epogen® and “A-thal” to be corrupted.
The text was updated successfully, but these errors were encountered: