Unicode differences between re2 and re? #5

turian · 2011-04-29T22:29:16Z

I am seeing difference betweens re2 and re when there is re.UNICODE being using.

I am not able to get re2 to detect Unicode alphabetic characters, even when I encode to UTF-8.

Here is an example:

In [24]: print u'\xe8'.encode("utf-8")
è

In [25]: re.compile('[^\W]', re.UNICODE).search(u'\xe8')
Out[25]: <_sre.SRE_Match object at 0x1186850>

In [26]: re2.compile('[^\W]', re.UNICODE).search(u'\xe8')

In [27]: re2.compile('[^\W]', re.UNICODE).search(u'\xe8'.encode("utf-8"))

itsadok · 2011-05-05T09:11:31Z

This is a glaring omission in prepare_pattern: we only handle \d, \w and \s, but not the corresponding \D, \W and \S. I'll try to find some time to fix it.

turian · 2011-11-08T00:26:34Z

Please.

axiak · 2011-11-08T18:02:27Z

We had an issue with \W, \D and \S that itsadok just fixed and I pushed out. However, I think there are still unicode issues as the groups in issue #4 don't match up quite right (I added it as a test). Please pull the latest version and see if it works for you as I try to see why the test is failing.

axiak pushed a commit that referenced this issue Nov 8, 2011

Fixed issue #5, support \W, \S and \D

61b6c48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode differences between re2 and re? #5

Unicode differences between re2 and re? #5

turian commented Apr 29, 2011

itsadok commented May 5, 2011

turian commented Nov 8, 2011

axiak commented Nov 8, 2011

Unicode differences between re2 and re? #5

Unicode differences between re2 and re? #5

Comments

turian commented Apr 29, 2011

itsadok commented May 5, 2011

turian commented Nov 8, 2011

axiak commented Nov 8, 2011