Failure with some UTF8 control characters accepted by python #5

ivanprado · 2018-11-14T10:35:11Z

There are some UTF8 characters that make cld2 detector fails even if being UTF8 allowed characters. An example from mikemccand/chromium-compact-language-detector#22 (comment):

test=u"\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n"
test.encode('utf8')
Out[23]: b'\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xc2\xa325 more filler.\nadditilnal filler.\n\nyet more\xc2\xa0still more\xc2\xa0filler.\n\n\xc2\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n'
cld2.detect(test.encode())
Traceback (most recent call last):
  File "/home/ivan/Documentos/scrapinghub/dev/web-rcnn-venv/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-24-68905466763d>", line 1, in <module>
    cld2.detect(test.encode())
  File "/home/ivan/Documentos/scrapinghub/dev/web-rcnn-venv/lib/python3.6/site-packages/cld2/__init__.py", line 396, in detect
    cld_results.bytes_found))
ValueError: input contains invalid UTF-8 around byte 158 (of -1117539408)

I'm using the following workaround as suggested in this link: mikemccand/chromium-compact-language-detector#22 (comment):

html = ''.join([l for l in html if
                        unicodedata.category(l)[0] not in ('S', 'M', 'C')])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure with some UTF8 control characters accepted by python #5

Failure with some UTF8 control characters accepted by python #5

ivanprado commented Nov 14, 2018

Failure with some UTF8 control characters accepted by python #5

Failure with some UTF8 control characters accepted by python #5

Comments

ivanprado commented Nov 14, 2018