-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to handle utf-8 characters that python can handle? #22
Comments
Hi, I agree with DataNeel, I've seen the code blow up with some of the UTF8 control characters (https://en.wikipedia.org/wiki/C0_and_C1_control_codes). E.g. the sample text: "is� Able", which is made up of the following characters:
The second number is the ordinal value of the char and the final column is the unicode category as given by Where the funky character after "is" is the python char u"\u0085" (http://www.fileformat.info/info/unicode/char/85/index.htm) - a valid UTF-8 character. Passing this the the latest version of the language detector yields the error:
|
I'm also experiencing this. Any workarounds? |
This problem is also happening with me. Has any progress been made? |
Ditto - any updates? |
Hello, |
Nope. I just updated my code to ignore those errors – it was only .05% of my data :)
From: Motaz Saad <[email protected]<mailto:[email protected]>>
Reply-To: mikemccand/chromium-compact-language-detector <[email protected]<mailto:[email protected]>>
Date: Friday, December 2, 2016 at 9:14 PM
To: mikemccand/chromium-compact-language-detector <[email protected]<mailto:[email protected]>>
Cc: Zachary Steinert-Threlkeld <[email protected]<mailto:[email protected]>>, Comment <[email protected]<mailto:[email protected]>>
Subject: Re: [mikemccand/chromium-compact-language-detector] Unable to handle utf-8 characters that python can handle? (#22)
Hello,
Any news?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#22 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AFJys2MjotYrW45E80ivmL3k2OREMEczks5rEPqqgaJpZM4F0OHg>.
|
My workaround for this error : when this happen I just clean my html with something like: It's fine for me since it only happens rarely |
Thanks!
From: lcalem <[email protected]<mailto:[email protected]>>
Reply-To: mikemccand/chromium-compact-language-detector <[email protected]<mailto:[email protected]>>
Date: Monday, December 5, 2016 at 7:00 AM
To: mikemccand/chromium-compact-language-detector <[email protected]<mailto:[email protected]>>
Cc: Zachary Steinert-Threlkeld <[email protected]<mailto:[email protected]>>, Comment <[email protected]<mailto:[email protected]>>
Subject: Re: [mikemccand/chromium-compact-language-detector] Unable to handle utf-8 characters that python can handle? (#22)
My workaround for this error : when this happen I just clean my html with something like:
printable_str = ''.join(x for x in html_str if x in string.printable)
then re-launch the detect on this.
It's fine for me since it only happens rarely
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#22 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AFJys8_vvQpdZY5YFOlr1orT-VkC0U16ks5rFCcXgaJpZM4F0OHg>.
|
@lcalem Just a note: In Python 3, it's possible to use the
|
Just a minor correction.
|
a better workaround is
omitting only undesired utf8 chars |
`pycld` is fussy where it comes to UTF-8 (see mikemccand/chromium-compact-language-detector#22 and aboSamoor/polyglot#71). This strips out the characters that make `cld` choke. Thanks to @andreoua for the suggested fix.
It's actually only the import regex
RE_BAD_CHARS = regex.compile(r"\p{Cc}|\p{Cs}")
def remove_bad_chars(text):
return RE_BAD_CHARS.sub("", text)
remove_bad_chars("A\x96 bad char") # Cc category
# 'A bad char' I brute-forced each unicode character through Brute-force scriptimport sys
import unicodedata
from collections import defaultdict
unicode_characters_per_category = defaultdict(list)
for c in map(chr, range(sys.maxunicode + 1)):
unicode_characters_per_category[unicodedata.category(c)].append(c)
all_categories = [
"Cc", # Control 65
"Cf", # Format 161
"Co", # Private Use 0
"Cs", # Surrrogate 0
"Ll", # Lowercase Letter 2,151
"Lm", # Modifier Letter 259
"Lo", # Other Letter 121,414
"Lt", # Titlecase Letter 31
"Lu", # Uppercase Letter 1,788
"Mc", # Spacing Mark 429
"Me", # Enclosing Mark 13
"Mn", # Nonspacing Mark 1,826
"Nd", # Decimal Number 630
"Nl", # Letter Number 236
"No", # Other Number 888
"Pc", # Connector Punctuation 10
"Pd", # Dash Punctuation 24
"Pe", # Close Punctuation 73
"Pf", # Final Punctuation 10
"Pi", # Initial Punctuation 12
"Po", # Other Punctuation 588
"Ps", # Open Punctuation 75
"Sc", # Currency Symbol 62
"Sk", # Modifier Symbol 121
"Sm", # Math Symbol 948
"So", # Other Symbol 6,160
"Zl", # Line Separator 1
"Zp", # Paragraph Separator 1
"Zs", # Space Separator 17
]
from polyglot.text import Text
error_cats = set()
for cat in all_categories:
for char in unicode_characters_per_category[cat]:
try:
Text(char).words
except:
error_cats.add(cat)
# all categories that errored
print(error_cats) |
I'm trying to use cld2 on some scraped web data, and I am running into some encoding issues. The text is scraped with beatiful soup into a unicode format, and the from-format is specified to beautiful soup as utf-8. The html of the document declared that it was in utf-8. Below, I have included an example of one of the strings that I anonymized with some filler text.
When I try to encode or decode this text, python does not have any issues. When I try to run it through cld2, however, I get errors.
Am I not using this correctly? The characters appear to be legitimate, but cld2 is giving me a hard time.
The text was updated successfully, but these errors were encountered: