Unable to handle utf-8 characters that python can handle? #22

DataNeel · 2015-08-28T17:47:10Z

I'm trying to use cld2 on some scraped web data, and I am running into some encoding issues. The text is scraped with beatiful soup into a unicode format, and the from-format is specified to beautiful soup as utf-8. The html of the document declared that it was in utf-8. Below, I have included an example of one of the strings that I anonymized with some filler text.

When I try to encode or decode this text, python does not have any issues. When I try to run it through cld2, however, I get errors.

>>> test=u"\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n"
>>> test.encode('utf8')
'\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xc2\xa325 more filler.\nadditilnal filler.\n\nyet more\xc2\xa0still more\xc2\xa0filler.\n\n\xc2\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n'
>>> test.encode('utf8').decode('utf8')
u'\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n'
>>> cld2.detect(test)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 52: ordinal not in range(128)
>>> cld2.detect(test.encode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
cld2.error: input contains invalid UTF-8 around byte 158 (of 278)
>>> test.encode('utf8')[158:168]
'\x03\n\t\t\t\t\t\t  '

Am I not using this correctly? The characters appear to be legitimate, but cld2 is giving me a hard time.

The text was updated successfully, but these errors were encountered:

smithsimonj · 2015-12-10T14:30:03Z

Hi,

I agree with DataNeel, I've seen the code blow up with some of the UTF8 control characters (https://en.wikipedia.org/wiki/C0_and_C1_control_codes).

E.g. the sample text: "is� Able", which is made up of the following characters:

0 i 105 Ll
1 s 115 Ll
2 �133 Cc
3    32 Zs
4 A  65 Lu
5 b  98 Ll
6 l 108 Ll
7 e 101 Ll

The second number is the ordinal value of the char and the final column is the unicode category as given by unicodedata.category(char).

Where the funky character after "is" is the python char u"\u0085" (http://www.fileformat.info/info/unicode/char/85/index.htm) - a valid UTF-8 character.

Passing this the the latest version of the language detector yields the error:

error: input contains invalid UTF-8 around byte 4 (of 9)

carlosdubus · 2016-04-12T19:55:54Z

I'm also experiencing this. Any workarounds?

matheusportela · 2016-05-27T14:46:45Z

This problem is also happening with me. Has any progress been made?

ZacharyST · 2016-12-02T18:49:58Z

Ditto - any updates?

motazsaad · 2016-12-03T05:14:18Z

Hello,
Any news?

ZacharyST · 2016-12-04T18:45:33Z

Nope. I just updated my code to ignore those errors – it was only .05% of my data :) From: Motaz Saad <[email protected]<mailto:[email protected]>> Reply-To: mikemccand/chromium-compact-language-detector <[email protected]<mailto:[email protected]>> Date: Friday, December 2, 2016 at 9:14 PM To: mikemccand/chromium-compact-language-detector <[email protected]<mailto:[email protected]>> Cc: Zachary Steinert-Threlkeld <[email protected]<mailto:[email protected]>>, Comment <[email protected]<mailto:[email protected]>> Subject: Re: [mikemccand/chromium-compact-language-detector] Unable to handle utf-8 characters that python can handle? (#22) Hello, Any news? — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#22 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AFJys2MjotYrW45E80ivmL3k2OREMEczks5rEPqqgaJpZM4F0OHg>.

lcalem · 2016-12-05T15:00:39Z

My workaround for this error : when this happen I just clean my html with something like:
printable_str = ''.join(x for x in html_str if x in string.printable)
then re-launch the detect on this.

It's fine for me since it only happens rarely

ZacharyST · 2016-12-05T16:46:33Z

Thanks! From: lcalem <[email protected]<mailto:[email protected]>> Reply-To: mikemccand/chromium-compact-language-detector <[email protected]<mailto:[email protected]>> Date: Monday, December 5, 2016 at 7:00 AM To: mikemccand/chromium-compact-language-detector <[email protected]<mailto:[email protected]>> Cc: Zachary Steinert-Threlkeld <[email protected]<mailto:[email protected]>>, Comment <[email protected]<mailto:[email protected]>> Subject: Re: [mikemccand/chromium-compact-language-detector] Unable to handle utf-8 characters that python can handle? (#22) My workaround for this error : when this happen I just clean my html with something like: printable_str = ''.join(x for x in html_str if x in string.printable) then re-launch the detect on this. It's fine for me since it only happens rarely — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#22 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AFJys8_vvQpdZY5YFOlr1orT-VkC0U16ks5rFCcXgaJpZM4F0OHg>.

ales-t · 2018-01-05T10:53:31Z

@lcalem Just a note: string.printable only contains ASCII printable characters. When dealing with multiple languages, that can be a major limitation (e.g. it will remove all Chinese characters from a string in Chinese).

In Python 3, it's possible to use the isprintable() string method like this:

printable_str = ''.join(x for x in html_str if x in x.isprintable())

andreoua · 2018-10-29T13:32:00Z

Just a minor correction. in is not needed in the if statement. It should be:

printable_str = ''.join(x for x in html_str if x.isprintable())

gilko1981 · 2018-11-05T14:57:18Z

a better workaround is

text = ''.join([l for l in text if unicodedata.category(unicode(l))[0] not in ('S', 'M', 'C')])

omitting only undesired utf8 chars
see
http://www.fileformat.info/info/unicode/category/index.htm

@andreoua

`pycld` is fussy where it comes to UTF-8 (see mikemccand/chromium-compact-language-detector#22 and aboSamoor/polyglot#71). This strips out the characters that make `cld` choke. Thanks to @andreoua for the suggested fix.

ddelange · 2020-10-13T20:47:12Z

It's actually only the Cc and Cs unicode categories that throw this error as far as I can tell. Using regex to remove them as suggested here should do the trick.

import regex

RE_BAD_CHARS = regex.compile(r"\p{Cc}|\p{Cs}")

def remove_bad_chars(text):
    return RE_BAD_CHARS.sub("", text)

remove_bad_chars("A\x96 bad char")  # Cc category
# 'A bad char'

I brute-forced each unicode character through polyglot on py38 (ref aboSamoor/polyglot#71 (comment)):

Brute-force script

import sys
import unicodedata
from collections import defaultdict

unicode_characters_per_category = defaultdict(list)
for c in map(chr, range(sys.maxunicode + 1)):
    unicode_characters_per_category[unicodedata.category(c)].append(c)

all_categories = [
    "Cc",  # Control 65
    "Cf",  # Format  161
    "Co",  # Private Use 0
    "Cs",  # Surrrogate  0
    "Ll",  # Lowercase Letter    2,151
    "Lm",  # Modifier Letter 259
    "Lo",  # Other Letter    121,414
    "Lt",  # Titlecase Letter    31
    "Lu",  # Uppercase Letter    1,788
    "Mc",  # Spacing Mark    429
    "Me",  # Enclosing Mark  13
    "Mn",  # Nonspacing Mark 1,826
    "Nd",  # Decimal Number  630
    "Nl",  # Letter Number   236
    "No",  # Other Number    888
    "Pc",  # Connector Punctuation   10
    "Pd",  # Dash Punctuation    24
    "Pe",  # Close Punctuation   73
    "Pf",  # Final Punctuation   10
    "Pi",  # Initial Punctuation 12
    "Po",  # Other Punctuation   588
    "Ps",  # Open Punctuation    75
    "Sc",  # Currency Symbol 62
    "Sk",  # Modifier Symbol 121
    "Sm",  # Math Symbol 948
    "So",  # Other Symbol    6,160
    "Zl",  # Line Separator  1
    "Zp",  # Paragraph Separator 1
    "Zs",  # Space Separator 17
]

from polyglot.text import Text

error_cats = set()
for cat in all_categories:
    for char in unicode_characters_per_category[cat]:
        try:
            Text(char).words
        except:
            error_cats.add(cat)

# all categories that errored
print(error_cats)

jamesdbaker mentioned this issue Feb 21, 2018

Error on language detection for some unicode characters (control characters) aboSamoor/polyglot#71

Open

ivanprado mentioned this issue Nov 14, 2018

Failure with some UTF8 control characters accepted by python scrapinghub/python-cld2#5

Open

ned2 mentioned this issue Mar 28, 2022

error: input contains invalid UTF-8 around byte 30 (of 68) aboSamoor/pycld2#53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to handle utf-8 characters that python can handle? #22

Unable to handle utf-8 characters that python can handle? #22

DataNeel commented Aug 28, 2015

smithsimonj commented Dec 10, 2015

carlosdubus commented Apr 12, 2016

matheusportela commented May 27, 2016

ZacharyST commented Dec 2, 2016

motazsaad commented Dec 3, 2016

ZacharyST commented Dec 4, 2016 via email

lcalem commented Dec 5, 2016

ZacharyST commented Dec 5, 2016 via email

ales-t commented Jan 5, 2018

andreoua commented Oct 29, 2018

gilko1981 commented Nov 5, 2018

ddelange commented Oct 13, 2020

Unable to handle utf-8 characters that python can handle? #22

Unable to handle utf-8 characters that python can handle? #22

Comments

DataNeel commented Aug 28, 2015

smithsimonj commented Dec 10, 2015

carlosdubus commented Apr 12, 2016

matheusportela commented May 27, 2016

ZacharyST commented Dec 2, 2016

motazsaad commented Dec 3, 2016

ZacharyST commented Dec 4, 2016 via email

lcalem commented Dec 5, 2016

ZacharyST commented Dec 5, 2016 via email

ales-t commented Jan 5, 2018

andreoua commented Oct 29, 2018

gilko1981 commented Nov 5, 2018

ddelange commented Oct 13, 2020