Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode differences between re2 and re? #5

Open
turian opened this issue Apr 29, 2011 · 3 comments
Open

Unicode differences between re2 and re? #5

turian opened this issue Apr 29, 2011 · 3 comments

Comments

@turian
Copy link

turian commented Apr 29, 2011

I am seeing difference betweens re2 and re when there is re.UNICODE being using.

I am not able to get re2 to detect Unicode alphabetic characters, even when I encode to UTF-8.

Here is an example:

In [24]: print u'\xe8'.encode("utf-8")
è

In [25]: re.compile('[^\W]', re.UNICODE).search(u'\xe8')
Out[25]: <_sre.SRE_Match object at 0x1186850>

In [26]: re2.compile('[^\W]', re.UNICODE).search(u'\xe8')

In [27]: re2.compile('[^\W]', re.UNICODE).search(u'\xe8'.encode("utf-8"))
@itsadok
Copy link

itsadok commented May 5, 2011

This is a glaring omission in prepare_pattern: we only handle \d, \w and \s, but not the corresponding \D, \W and \S. I'll try to find some time to fix it.

@turian
Copy link
Author

turian commented Nov 8, 2011

Please.

axiak pushed a commit that referenced this issue Nov 8, 2011
@axiak
Copy link
Owner

axiak commented Nov 8, 2011

We had an issue with \W, \D and \S that itsadok just fixed and I pushed out. However, I think there are still unicode issues as the groups in issue #4 don't match up quite right (I added it as a test). Please pull the latest version and see if it works for you as I try to see why the test is failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants