-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extracting an attribute with diacritical character at the end is cut-off #39
Comments
The positions 732 ... 736 indicate that the result has a length of 5 characters. |
Looks like a problem of using '\b'. Is it possible that in Hyperscan word boundaries '\b' are not capable of processing UTF-8 correctly or there exists an option needed, that is not defined? See the following table for outputs, the input is always 'René': |
Aha. I can of course use spaces, because I have a very rigid text format and I don't want to match things at the very beginning or end. So I can work around the issue. :-) |
Ah, not that simple, you get something like:
|
Do I see this correctly: Hyperscan doesn't know about subgroups in regular expression, But this also means that I cannot formulate boundaries for a token to be NOT extracted as stated above (so the space, dot would be part of the word token). \b get also extracted, but as it has no |
In Perl I would write:
and extract group(1). |
Maybe omitting the \b is the best option here, like this:
|
One idea would be to use a second regex library like PCRE if a Hyperscan regex matches and if
The $1, $2 $3 are the placeholders to form a new string. |
Now this bug report gets loaded with too many things. |
Added pattern lexer option BYTECHAR that forces to use a map to a virtual one byte character set as it is done for making edit distance matching unicode capable. The option fixes the issue of the hyperscan library that seems to have problem with \b word boundaries in combination with UTF-8. |
The lexer option BYTECHAR doesn't help to solve the problem. |
The rule is:
Extracting a word with 'René' gets:
on the other hand if the diacritical character is in the middle or beginning:
works.
The text was updated successfully, but these errors were encountered: