Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure with some UTF8 control characters accepted by python #5

Open
ivanprado opened this issue Nov 14, 2018 · 0 comments
Open

Failure with some UTF8 control characters accepted by python #5

ivanprado opened this issue Nov 14, 2018 · 0 comments

Comments

@ivanprado
Copy link

There are some UTF8 characters that make cld2 detector fails even if being UTF8 allowed characters. An example from mikemccand/chromium-compact-language-detector#22 (comment):

test=u"\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n"
test.encode('utf8')
Out[23]: b'\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xc2\xa325 more filler.\nadditilnal filler.\n\nyet more\xc2\xa0still more\xc2\xa0filler.\n\n\xc2\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n'
cld2.detect(test.encode())
Traceback (most recent call last):
  File "/home/ivan/Documentos/scrapinghub/dev/web-rcnn-venv/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-24-68905466763d>", line 1, in <module>
    cld2.detect(test.encode())
  File "/home/ivan/Documentos/scrapinghub/dev/web-rcnn-venv/lib/python3.6/site-packages/cld2/__init__.py", line 396, in detect
    cld_results.bytes_found))
ValueError: input contains invalid UTF-8 around byte 158 (of -1117539408)

I'm using the following workaround as suggested in this link: mikemccand/chromium-compact-language-detector#22 (comment):

html = ''.join([l for l in html if
                        unicodedata.category(l)[0] not in ('S', 'M', 'C')])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant