An ANTLR 3 grammar that generates a parser able to parse PCRE (Perl compatible regular expressions) and produce an abstract syntax tree (AST) of such expressions.
For an ANTLR 4 grammar, have a look here: https://github.com/bkiers/pcre-parser
For a JavaScript version, checkout the js branch.
To get the library, checkout the project and run mvn clean install
, or
download the jar.
You can also try the parser online: pcreparser.appspot.com
The main class of this library is the pcreparser.PCRE
class. Below are some examples of supported functionality.
source:
PCRE pcre = new PCRE("((.)\\1+ (?<YEAR>(?:19|20)\\d{2})) [^]-x]");
System.out.println(pcre.getGroupCount());
output:
3
Note that the named capture group, (?<YEAR>(?:19|20)\\d{2})
, also counts. Below is the list of groups:
((.)\\1+ (?<YEAR>(?:19|20)\\d{2}))
(.)
(?<YEAR>(?:19|20)\\d{2})
source:
PCRE pcre = new PCRE("((.)\\1+ (?<YEAR>(?:19|20)\\d{2})) [^]-x]");
System.out.println(pcre.getNamedGroupCount());
output:
1
source:
PCRE pcre = new PCRE("((.)\\1+ (?<YEAR>(?:19|20)\\d{2})) [^]-x]");
System.out.println(pcre.toStringASCII()); // equivalent to: pcre.toStringASCII(0)
output:
'- ALTERNATIVE
|- ELEMENT
| '- CAPTURING_GROUP
| '- ALTERNATIVE
| |- ELEMENT
| | '- CAPTURING_GROUP
| | '- ALTERNATIVE
| | '- ELEMENT
| | '- ANY
| |- ELEMENT
| | |- NUMBERED_BACKREFERENCE
| | | '- NUMBER='1'
| | '- QUANTIFIER
| | |- NUMBER='1'
| | |- NUMBER='2147483647'
| | '- GREEDY
| |- ELEMENT
| | '- LITERAL=' '
| '- ELEMENT
| '- NAMED_CAPTURING_GROUP_PERL
| |- NAME='YEAR'
| '- ALTERNATIVE
| |- ELEMENT
| | '- NON_CAPTURING_GROUP
| | '- OR
| | |- ALTERNATIVE
| | | |- ELEMENT
| | | | '- LITERAL='1'
| | | '- ELEMENT
| | | '- LITERAL='9'
| | '- ALTERNATIVE
| | |- ELEMENT
| | | '- LITERAL='2'
| | '- ELEMENT
| | '- LITERAL='0'
| '- ELEMENT
| |- DecimalDigit='\d'
| '- QUANTIFIER
| |- NUMBER='2'
| |- NUMBER='2'
| '- GREEDY
|- ELEMENT
| '- LITERAL=' '
'- ELEMENT
'- NEGATED_CHARACTER_CLASS
'- RANGE
|- LITERAL=']'
'- LITERAL='x'
Or to print a specific group or named group:
source:
PCRE pcre = new PCRE("((.)\\1+ (?<YEAR>(?:19|20)\\d{2})) [^]-x]");
System.out.println(pcre.toStringASCII(2));
output:
'- CAPTURING_GROUP
'- ALTERNATIVE
'- ELEMENT
'- ANY
source:
PCRE pcre = new PCRE("((.)\\1+ (?<YEAR>(?:19|20)\\d{2})) [^]-x]");
System.out.println(pcre.toStringASCII("YEAR"));
output:
'- NAMED_CAPTURING_GROUP_PERL
|- NAME='YEAR'
'- ALTERNATIVE
|- ELEMENT
| '- NON_CAPTURING_GROUP
| '- OR
| |- ALTERNATIVE
| | |- ELEMENT
| | | '- LITERAL='1'
| | '- ELEMENT
| | '- LITERAL='9'
| '- ALTERNATIVE
| |- ELEMENT
| | '- LITERAL='2'
| '- ELEMENT
| '- LITERAL='0'
'- ELEMENT
|- DecimalDigit='\d'
'- QUANTIFIER
|- NUMBER='2'
|- NUMBER='2'
'- GREEDY
Besides the toStringASCII()
method demonstrated above, there are some other methods able to display the AST:
PCRE#toStringDOT()
: creates a DOT-representation of group0
PCRE#toStringDOT(int n)
: creates a DOT-representation of groupn
PCRE#toStringDOT(String s)
: creates a DOT-representation of named groups
In order to get the actual AST from the pattern, use one of the following methods:
PCRE#getCommonTree()
: get the AST of group0
PCRE#getCommonTree(int n)
: get the AST of groupn
PCRE#getCommonTree(String s)
: get the AST of named groups
All methods above return a CommonTree that has the following attributes:
CommonTree#getChildren(): List
: ajava.util.List
of all child nodes/AST'sCommonTree#getType(): int
: the token type of the AST (token types can be found as staticint
s in PCRELexer, once generated)CommonTree#getText(): String
: the text the token associated with this node matched during parsing- the API