extracting an attribute with diacritical character at the end is cut-off #39

andreasbaumann · 2018-05-14T09:35:37Z

The rule is:

WORD ^1            : /\b([\p{L}\d]+)\b/;
Citizien = any( ... );
CitizenWord = any( WORD "Staatsangehöriger", WORD "Staatsangehörige" );
Person             = sequence_imm( last = WORD, COMMA, first = WORD, COMMA, Citizen, CitizenWord, COMMA, WORD "in", wohnort = WORD, COMMA );

Extracting a word with 'René' gets:

first [133..134, 0|732 .. 0|736] 'Ren'

on the other hand if the diacritical character is in the middle or beginning:

wohnort [107..108, 0|565 .. 0|572] 'Zürich'

works.

The text was updated successfully, but these errors were encountered:

patrickfrey · 2018-05-14T10:59:44Z

The positions 732 ... 736 indicate that the result has a length of 5 characters.
So the result is calculated correctly, but the output (string 'Ren') is not.

patrickfrey · 2018-05-14T14:28:43Z

Looks like a problem of using '\b'. Is it possible that in Hyperscan word boundaries '\b' are not capable of processing UTF-8 correctly or there exists an option needed, that is not defined?

See the following table for outputs, the input is always 'René':
Regex Output
/\p{L}+/ René
/\S+/ René
/\w+/ Ren
/\b\p{L}+\b/ Ren

andreasbaumann · 2018-05-14T15:03:21Z

Aha. I can of course use spaces, because I have a very rigid text format and I don't want to match things at the very beginning or end. So I can work around the issue. :-)

andreasbaumann · 2018-05-15T06:20:03Z

Ah, not that simple, you get something like:

WORD ^1            : /^([\p{L}\d]+)[\s\.,!\?\:]/;
WORD ^1            : /[\s\.,!\?\:]([\p{L}\d]+)[\s\.,!\?\:]$/;
WORD ^1            : /[\s\.,!\?\:]([\p{L}\d]+)[\s\.,!\?\:]/;

andreasbaumann · 2018-05-15T06:26:18Z

Do I see this correctly: Hyperscan doesn't know about subgroups in regular expression,
so it either provides the end of the match or the begin,end of the match?

But this also means that I cannot formulate boundaries for a token to be NOT extracted as stated above (so the space, dot would be part of the word token). \b get also extracted, but as it has no
character representation, it MOSTLY works.

andreasbaumann · 2018-05-15T06:27:16Z

In Perl I would write:

WORD ^1            : /(?:^|(?:[ ;.,!?:-_]([\p{L}\d]+)(?:[\s.,!?:-_]|$)/;

and extract group(1).

andreasbaumann · 2018-05-15T06:37:34Z

Maybe omitting the \b is the best option here, like this:

WORD ^1            : /[\p{L}\d]+/;

andreasbaumann · 2018-05-15T06:42:50Z

One idea would be to use a second regex library like PCRE if a Hyperscan regex matches and if
strusPatternMatcher detects subgroups in the regex (sort of a post filtering). Though the syntax
is not quite clear, what should WORD in the example above be if there are two capturing groups:

PHONE : /(\d{2,3}) (\d{2,3})\-\(d{2,3})/; ["$1_$2_$3"]

The $1, $2 $3 are the placeholders to form a new string.

patrickfrey · 2018-05-15T08:48:51Z

Now this bug report gets loaded with too many things.
Subexpression matching is possible, but you can select only one element.
The forming of patterns for the result is a good idea for the lexer and the matcher as well.

patrickfrey · 2018-05-15T08:51:38Z

Added pattern lexer option BYTECHAR that forces to use a map to a virtual one byte character set as it is done for making edit distance matching unicode capable. The option fixes the issue of the hyperscan library that seems to have problem with \b word boundaries in combination with UTF-8.
I will report the issue to hyperscan.

patrickfrey · 2018-05-15T10:42:07Z

The lexer option BYTECHAR doesn't help to solve the problem.
The example "français" is mapped to "fran\347ais" and "\b" splits it to "fran" and "ais".
I'm running out of ideas.

andreasbaumann changed the title ~~extracting a attribute with diachritical at the end is cut-off~~ extracting a attribute with diacritical character at the end is cut-off May 14, 2018

andreasbaumann changed the title ~~extracting a attribute with diacritical character at the end is cut-off~~ extracting an attribute with diacritical character at the end is cut-off May 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extracting an attribute with diacritical character at the end is cut-off #39

extracting an attribute with diacritical character at the end is cut-off #39

andreasbaumann commented May 14, 2018

patrickfrey commented May 14, 2018 •

edited

Loading

patrickfrey commented May 14, 2018

andreasbaumann commented May 14, 2018 •

edited

Loading

andreasbaumann commented May 15, 2018

andreasbaumann commented May 15, 2018 •

edited

Loading

andreasbaumann commented May 15, 2018

andreasbaumann commented May 15, 2018

andreasbaumann commented May 15, 2018 •

edited

Loading

patrickfrey commented May 15, 2018

patrickfrey commented May 15, 2018

patrickfrey commented May 15, 2018

extracting an attribute with diacritical character at the end is cut-off #39

extracting an attribute with diacritical character at the end is cut-off #39

Comments

andreasbaumann commented May 14, 2018

patrickfrey commented May 14, 2018 • edited Loading

patrickfrey commented May 14, 2018

andreasbaumann commented May 14, 2018 • edited Loading

andreasbaumann commented May 15, 2018

andreasbaumann commented May 15, 2018 • edited Loading

andreasbaumann commented May 15, 2018

andreasbaumann commented May 15, 2018

andreasbaumann commented May 15, 2018 • edited Loading

patrickfrey commented May 15, 2018

patrickfrey commented May 15, 2018

patrickfrey commented May 15, 2018

patrickfrey commented May 14, 2018 •

edited

Loading

andreasbaumann commented May 14, 2018 •

edited

Loading

andreasbaumann commented May 15, 2018 •

edited

Loading

andreasbaumann commented May 15, 2018 •

edited

Loading