Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to handle utf-8 characters that python can handle? #22

Open
DataNeel opened this issue Aug 28, 2015 · 12 comments
Open

Unable to handle utf-8 characters that python can handle? #22

DataNeel opened this issue Aug 28, 2015 · 12 comments

Comments

@DataNeel
Copy link

I'm trying to use cld2 on some scraped web data, and I am running into some encoding issues. The text is scraped with beatiful soup into a unicode format, and the from-format is specified to beautiful soup as utf-8. The html of the document declared that it was in utf-8. Below, I have included an example of one of the strings that I anonymized with some filler text.

When I try to encode or decode this text, python does not have any issues. When I try to run it through cld2, however, I get errors.

>>> test=u"\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n"
>>> test.encode('utf8')
'\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xc2\xa325 more filler.\nadditilnal filler.\n\nyet more\xc2\xa0still more\xc2\xa0filler.\n\n\xc2\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n'
>>> test.encode('utf8').decode('utf8')
u'\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n'
>>> cld2.detect(test)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 52: ordinal not in range(128)
>>> cld2.detect(test.encode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
cld2.error: input contains invalid UTF-8 around byte 158 (of 278)
>>> test.encode('utf8')[158:168]
'\x03\n\t\t\t\t\t\t  '

Am I not using this correctly? The characters appear to be legitimate, but cld2 is giving me a hard time.

@smithsimonj
Copy link

Hi,

I agree with DataNeel, I've seen the code blow up with some of the UTF8 control characters (https://en.wikipedia.org/wiki/C0_and_C1_control_codes).

E.g. the sample text: "is� Able", which is made up of the following characters:

0 i 105 Ll
1 s 115 Ll
2 �133 Cc
3    32 Zs
4 A  65 Lu
5 b  98 Ll
6 l 108 Ll
7 e 101 Ll

The second number is the ordinal value of the char and the final column is the unicode category as given by unicodedata.category(char).

Where the funky character after "is" is the python char u"\u0085" (http://www.fileformat.info/info/unicode/char/85/index.htm) - a valid UTF-8 character.

Passing this the the latest version of the language detector yields the error:

error: input contains invalid UTF-8 around byte 4 (of 9)

@carlosdubus
Copy link

I'm also experiencing this. Any workarounds?

@matheusportela
Copy link

This problem is also happening with me. Has any progress been made?

@ZacharyST
Copy link

Ditto - any updates?

@motazsaad
Copy link

Hello,
Any news?

@ZacharyST
Copy link

ZacharyST commented Dec 4, 2016 via email

@lcalem
Copy link

lcalem commented Dec 5, 2016

My workaround for this error : when this happen I just clean my html with something like:
printable_str = ''.join(x for x in html_str if x in string.printable)
then re-launch the detect on this.

It's fine for me since it only happens rarely

@ZacharyST
Copy link

ZacharyST commented Dec 5, 2016 via email

@ales-t
Copy link

ales-t commented Jan 5, 2018

@lcalem Just a note: string.printable only contains ASCII printable characters. When dealing with multiple languages, that can be a major limitation (e.g. it will remove all Chinese characters from a string in Chinese).

In Python 3, it's possible to use the isprintable() string method like this:

printable_str = ''.join(x for x in html_str if x in x.isprintable())

@andreoua
Copy link

Just a minor correction. in is not needed in the if statement. It should be:

printable_str = ''.join(x for x in html_str if x.isprintable())

@gilko1981
Copy link

a better workaround is

text = ''.join([l for l in text if unicodedata.category(unicode(l))[0] not in ('S', 'M', 'C')])

omitting only undesired utf8 chars
see
http://www.fileformat.info/info/unicode/category/index.htm

sjlongland added a commit to sjlongland/hackaday.io-spambot-hunter that referenced this issue Dec 7, 2018
`pycld` is fussy where it comes to UTF-8 (see
mikemccand/chromium-compact-language-detector#22
and aboSamoor/polyglot#71).  This strips out
the characters that make `cld` choke.

Thanks to @andreoua for the suggested fix.
@ddelange
Copy link

It's actually only the Cc and Cs unicode categories that throw this error as far as I can tell. Using regex to remove them as suggested here should do the trick.

import regex

RE_BAD_CHARS = regex.compile(r"\p{Cc}|\p{Cs}")

def remove_bad_chars(text):
    return RE_BAD_CHARS.sub("", text)

remove_bad_chars("A\x96 bad char")  # Cc category
# 'A bad char'

I brute-forced each unicode character through polyglot on py38 (ref aboSamoor/polyglot#71 (comment)):

Brute-force script
import sys
import unicodedata
from collections import defaultdict

unicode_characters_per_category = defaultdict(list)
for c in map(chr, range(sys.maxunicode + 1)):
    unicode_characters_per_category[unicodedata.category(c)].append(c)

all_categories = [
    "Cc",  # Control 65
    "Cf",  # Format  161
    "Co",  # Private Use 0
    "Cs",  # Surrrogate  0
    "Ll",  # Lowercase Letter    2,151
    "Lm",  # Modifier Letter 259
    "Lo",  # Other Letter    121,414
    "Lt",  # Titlecase Letter    31
    "Lu",  # Uppercase Letter    1,788
    "Mc",  # Spacing Mark    429
    "Me",  # Enclosing Mark  13
    "Mn",  # Nonspacing Mark 1,826
    "Nd",  # Decimal Number  630
    "Nl",  # Letter Number   236
    "No",  # Other Number    888
    "Pc",  # Connector Punctuation   10
    "Pd",  # Dash Punctuation    24
    "Pe",  # Close Punctuation   73
    "Pf",  # Final Punctuation   10
    "Pi",  # Initial Punctuation 12
    "Po",  # Other Punctuation   588
    "Ps",  # Open Punctuation    75
    "Sc",  # Currency Symbol 62
    "Sk",  # Modifier Symbol 121
    "Sm",  # Math Symbol 948
    "So",  # Other Symbol    6,160
    "Zl",  # Line Separator  1
    "Zp",  # Paragraph Separator 1
    "Zs",  # Space Separator 17
]

from polyglot.text import Text

error_cats = set()
for cat in all_categories:
    for char in unicode_characters_per_category[cat]:
        try:
            Text(char).words
        except:
            error_cats.add(cat)

# all categories that errored
print(error_cats)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests