Extract matching sequence of tokens #41

andreasbaumann · 2018-05-16T09:17:41Z

Sometimes you want to extract the original text of a matching sequence, for instance:

WORD ^1       : /([\p{L}\d]+)/;
SIGN              : /[.,!?:;]/;

Rule = sequence_imm( c[] = any( WORD | SIGN ) {3, 8}

(the rule syntax is a complete invention of mine :-) ).

The matching text is:

Uhren, Porzellan Gmbh, St. Gallen

I would like to extract the text with dots and commas. So far I'm only able to extract
the single tokens:

XX [XXX] : 1 WORD Uhren
XX [XXX] : 2 SIGN ,
XX [XXX] : 1 WORD Porzellan
XX [XXX] : 1 WORD GmbH
XX [XXX] : 2 SIGN ,
XX [XXX] : 1 WORD St
XX [XXX] : 2 SIGN .
XX [XXX] : 1 WORD Gallen

So, it's not easy to reconstruct the text from the c[] array of tokens.

The text was updated successfully, but these errors were encountered:

patrickfrey · 2018-07-02T16:29:47Z

Sequences can only be built as left regular expressions, e.g.:

SEQ = any( pref = WORD);
SEQ = sequence_imm( pref=SEQ, next=WORD ) ["{pref} {next}"];

This definition creates SEQ pattern for every subsequence.
Restriction to the longest sequence is only possible with the matcher option

%MATCHER exclusive

This is not enough. But it's yet unclear what has to be implemented and what not.

patrickfrey mentioned this issue Jul 2, 2018

support for sequences #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract matching sequence of tokens #41

Extract matching sequence of tokens #41

andreasbaumann commented May 16, 2018 •

edited

Loading

patrickfrey commented Jul 2, 2018

Extract matching sequence of tokens #41

Extract matching sequence of tokens #41

Comments

andreasbaumann commented May 16, 2018 • edited Loading

patrickfrey commented Jul 2, 2018

andreasbaumann commented May 16, 2018 •

edited

Loading