Character encoding issues in boilerplate processing #29

tfmorris · 2016-04-04T14:17:59Z

The output from the boilerplate processeor, e.g. /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt, appears to use a character encoding other than UTF-8. This causes strings such as Epogen® and “A-thal” to be corrupted.

tfmorris · 2016-04-08T18:16:22Z

After downloading and looking at the original data set, it turns out that the character set decoding being done wrong on the input side. The output looks like it is correctly writing UTF-8, but the characters are already corrupted by then.

In the particular case of 105.html the source encoding is windows-1252. Rather than using the file utilities to read the file into a string, the HTML parser should be allowed to parse the byte stream directly and use the encoding that it finds there. The current scheme will corrupt all non-UTF-8 documents.

tfmorris · 2016-04-09T17:32:56Z

And to follow up on my last comment, this only affects the standalone program, not the Hadoop processing. Rather than allowing JSoup do the character set determination, I decided to keep the API the same and use the same character encoding detection that the Hadoop processing does in the standalone program.

I've got a PR with fixes for all the boilerplate problems that I've seen (and some related stuff).

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 9, 2016

Handle non-UTF-8 input files. Fixes dkpro#29.

af105ec

tfmorris mentioned this issue Apr 9, 2016

Fix O(n!) in tag depth issue #28

Open

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 9, 2016

Handle non-UTF-8 input files. Fixes dkpro#29.

9995429

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 10, 2016

Handle non-UTF-8 input files. Fixes dkpro#29.

6582fc8

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 13, 2016

Handle non-UTF-8 input files. Fixes dkpro#29.

fbb3198

habernal added this to the 1.0.1 milestone Apr 14, 2016

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 15, 2016

Handle non-UTF-8 input files. Fixes dkpro#29.

c08bb16

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 15, 2016

Handle non-UTF-8 input files. Fixes dkpro#29.

6c6edb5

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 28, 2016

Handle non-UTF-8 input files. Fixes dkpro#29.

50f1573

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Jun 12, 2020

Handle non-UTF-8 input files. Fixes dkpro#29.

e3c348b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character encoding issues in boilerplate processing #29

Character encoding issues in boilerplate processing #29

tfmorris commented Apr 4, 2016

tfmorris commented Apr 8, 2016

tfmorris commented Apr 9, 2016

Character encoding issues in boilerplate processing #29

Character encoding issues in boilerplate processing #29

Comments

tfmorris commented Apr 4, 2016

tfmorris commented Apr 8, 2016

tfmorris commented Apr 9, 2016