Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract matching sequence of tokens #41

Open
andreasbaumann opened this issue May 16, 2018 · 1 comment
Open

Extract matching sequence of tokens #41

andreasbaumann opened this issue May 16, 2018 · 1 comment

Comments

@andreasbaumann
Copy link
Collaborator

andreasbaumann commented May 16, 2018

Sometimes you want to extract the original text of a matching sequence, for instance:

WORD ^1       : /([\p{L}\d]+)/;
SIGN              : /[.,!?:;]/;

Rule = sequence_imm( c[] = any( WORD | SIGN ) {3, 8}

(the rule syntax is a complete invention of mine :-) ).

The matching text is:

Uhren, Porzellan Gmbh, St. Gallen

I would like to extract the text with dots and commas. So far I'm only able to extract
the single tokens:

XX [XXX] : 1 WORD Uhren
XX [XXX] : 2 SIGN ,
XX [XXX] : 1 WORD Porzellan
XX [XXX] : 1 WORD GmbH
XX [XXX] : 2 SIGN ,
XX [XXX] : 1 WORD St
XX [XXX] : 2 SIGN .
XX [XXX] : 1 WORD Gallen

So, it's not easy to reconstruct the text from the c[] array of tokens.

@patrickfrey
Copy link
Owner

Sequences can only be built as left regular expressions, e.g.:

SEQ = any( pref = WORD);
SEQ = sequence_imm( pref=SEQ, next=WORD ) ["{pref} {next}"];

This definition creates SEQ pattern for every subsequence.
Restriction to the longest sequence is only possible with the matcher option

%MATCHER exclusive

This is not enough. But it's yet unclear what has to be implemented and what not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants