Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extracting an attribute with diacritical character at the end is cut-off #39

Open
andreasbaumann opened this issue May 14, 2018 · 11 comments

Comments

@andreasbaumann
Copy link
Collaborator

The rule is:

WORD ^1            : /\b([\p{L}\d]+)\b/;
Citizien = any( ... );
CitizenWord = any( WORD "Staatsangehöriger", WORD "Staatsangehörige" );
Person             = sequence_imm( last = WORD, COMMA, first = WORD, COMMA, Citizen, CitizenWord, COMMA, WORD "in", wohnort = WORD, COMMA );

Extracting a word with 'René' gets:

first [133..134, 0|732 .. 0|736] 'Ren'

on the other hand if the diacritical character is in the middle or beginning:

wohnort [107..108, 0|565 .. 0|572] 'Zürich'

works.

@andreasbaumann andreasbaumann changed the title extracting a attribute with diachritical at the end is cut-off extracting a attribute with diacritical character at the end is cut-off May 14, 2018
@andreasbaumann andreasbaumann changed the title extracting a attribute with diacritical character at the end is cut-off extracting an attribute with diacritical character at the end is cut-off May 14, 2018
@patrickfrey
Copy link
Owner

patrickfrey commented May 14, 2018

The positions 732 ... 736 indicate that the result has a length of 5 characters.
So the result is calculated correctly, but the output (string 'Ren') is not.

@patrickfrey
Copy link
Owner

Looks like a problem of using '\b'. Is it possible that in Hyperscan word boundaries '\b' are not capable of processing UTF-8 correctly or there exists an option needed, that is not defined?

See the following table for outputs, the input is always 'René':
Regex Output
/\p{L}+/ René
/\S+/ René
/\w+/ Ren
/\b\p{L}+\b/ Ren

@andreasbaumann
Copy link
Collaborator Author

andreasbaumann commented May 14, 2018

Aha. I can of course use spaces, because I have a very rigid text format and I don't want to match things at the very beginning or end. So I can work around the issue. :-)

@andreasbaumann
Copy link
Collaborator Author

Ah, not that simple, you get something like:

WORD ^1            : /^([\p{L}\d]+)[\s\.,!\?\:]/;
WORD ^1            : /[\s\.,!\?\:]([\p{L}\d]+)[\s\.,!\?\:]$/;
WORD ^1            : /[\s\.,!\?\:]([\p{L}\d]+)[\s\.,!\?\:]/;

@andreasbaumann
Copy link
Collaborator Author

andreasbaumann commented May 15, 2018

Do I see this correctly: Hyperscan doesn't know about subgroups in regular expression,
so it either provides the end of the match or the begin,end of the match?

But this also means that I cannot formulate boundaries for a token to be NOT extracted as stated above (so the space, dot would be part of the word token). \b get also extracted, but as it has no
character representation, it MOSTLY works.

@andreasbaumann
Copy link
Collaborator Author

In Perl I would write:

WORD ^1            : /(?:^|(?:[ ;.,!?:-_]([\p{L}\d]+)(?:[\s.,!?:-_]|$)/;

and extract group(1).

@andreasbaumann
Copy link
Collaborator Author

Maybe omitting the \b is the best option here, like this:

WORD ^1            : /[\p{L}\d]+/;

@andreasbaumann
Copy link
Collaborator Author

andreasbaumann commented May 15, 2018

One idea would be to use a second regex library like PCRE if a Hyperscan regex matches and if
strusPatternMatcher detects subgroups in the regex (sort of a post filtering). Though the syntax
is not quite clear, what should WORD in the example above be if there are two capturing groups:

PHONE : /(\d{2,3}) (\d{2,3})\-\(d{2,3})/; ["$1_$2_$3"]

The $1, $2 $3 are the placeholders to form a new string.

@patrickfrey
Copy link
Owner

Now this bug report gets loaded with too many things.
Subexpression matching is possible, but you can select only one element.
The forming of patterns for the result is a good idea for the lexer and the matcher as well.

@patrickfrey
Copy link
Owner

Added pattern lexer option BYTECHAR that forces to use a map to a virtual one byte character set as it is done for making edit distance matching unicode capable. The option fixes the issue of the hyperscan library that seems to have problem with \b word boundaries in combination with UTF-8.
I will report the issue to hyperscan.

@patrickfrey
Copy link
Owner

The lexer option BYTECHAR doesn't help to solve the problem.
The example "français" is mapped to "fran\347ais" and "\b" splits it to "fran" and "ais".
I'm running out of ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants