An embedded domain specific language for expressing parsers as extended parsing expression grammars (PEGs) in C++17
- Natural syntax more akin to external parser generator languages
- Separation of syntatic and lexical rules, with customizable implicit whitespace skipping
- Direct and indirect left recursion with precedence levels to disambiguate subexpressions with mixed left/right recursion
- Traditional PEG syntax has been extended to support attribute grammars
- Cut operator to commit to currently matched parse prefix and prune all backtrack entries
- Deferred evaluation of semantic actions, ensuring actions do not execute on failed branches or invalid input
- Generated parsers are compiled to special-purpose bytecode and executed in a virtual parsing machine
- UTF-8 text parsing with complete Level 1 and partial Level 2 support of the UTS #18 Unicode Regular Expressions technical standard
- Automatic line and column tracking with customizable tab width and alignment
- Uses expression template functors to implement the rules of the domain specific language
- Header only library using C++17 language and library features
- Relatively small with the intent of parser core to remain under 1500 lines of terse code
It is based on research introduced in the following papers:
Bryan Ford, Parsing expression grammars: a recognition-based syntactic foundation, Proceedings of the 31st ACM SIGPLAN-SIGACT symposium on Principles of Programming Languages, p.111-122, January 2004
Sérgio Medeiros et. al, A parsing machine for PEGs, Proceedings of the 2008 symposium on Dynamic Languages, p.1-12, July 2008
Kota Mizushima et. al, Packrat parsers can handle practical grammars in mostly constant space, Proceedings of the 9th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, p.29-36, June 2010
Sérgio Medeiros et. al, Left recursion in Parsing Expression Grammars, Science of Computer Programming, v.96 n.P2, p.177-190, December 2014
Leonardo Reis et. al, The formalization and implementation of Adaptable Parsing Expression Grammars, Science of Computer Programming, v.96 n.P2, p.191-210, December 2014
Sérgio Medeiros et. al, A parsing machine for parsing expression grammars with labeled failures, Proceedings of the 31st Annual ACM symposium on Applied Computing, p.1960-1967, April 2016
As a header only library, lug itself does not need to be built.
Simply ensure the lug
header directory is in your include path and you're good to go.
As a baseline, the following compiler versions are known to work with lug.
Compiler | Language Mode |
---|---|
Clang 5.0.0 (September 2017) | -std=c++17 or -std=gnu++17 |
GCC 7.1.0 (May 2017) | -std=c++17 or -std=gnu++17 |
Microsoft Visual C++ 2017 15.5 (December 2017) | Platform Toolset: Visual Studio 2017 Toolset (v141), Language Standard: ISO C++17 Standard (/std:c++17) |
To build the sample programs and unit tests, a makefile is provided for Linux and BSD platforms and a Visual Studio solution is available for use on Windows.
Operator | Syntax |
---|---|
Sequence | e1 > e2 |
Ordered Choice | e1 | e2 |
Zero-or-More | *e |
One-or-More | +e |
Optional | ~e |
Positive Lookahead | &e |
Negative Lookahead | !e |
Terminal | Description |
---|---|
chr(c) |
Matches the UTF-8, UTF-16, or UTF-32 character c |
chr(c1, c2) |
Matches characters in the UTF-8, UTF-16, or UTF-32 interval [c1-c2] |
str(s) |
Matches the sequence of characters in a string |
bre(s) |
POSIX Basic Regular Expression (BRE) |
any |
Matches any single character |
any(flags) |
Matches a character exhibiting any of the character properties |
all(flags) |
Matches a character with all of the character properties |
none(flags) |
Matches a character with none of the character properties |
eps |
Matches the empty string |
eoi |
Matches the end of the input sequence |
eol |
Matches a Unicode line-ending |
nop |
No operation, does not emit any instructions |
cut |
Emits a cut operation into the stream of semantic actions without matching |
Literal | Name | Description |
---|---|---|
_cx |
Character Expression | Matches the UTF-8, UTF-16, or UTF-32 character literal |
_sx |
String Expression | Matches the sequence of characters in a string literal |
_rx |
Regular Expression | POSIX Basic Regular Expression (BRE) |
_icx |
Case Insensitive Character Expression | Same as _cx but case insensitive |
_isx |
Case Insensitive String Expression | Same as _sx but case insensitive |
_irx |
Case Insensitive Regular Expression | Same as _rx but case insensitive |
_scx |
Case Sensitive Character Expression | Same as _cx but case sensitive |
_ssx |
Case Sensitive String Expression | Same as _sx but case sensitive |
_srx |
Case Sensitive Regular Expression | Same as _rx but case sensitive |
- parser error recovery
- add an interactive processing mode flag to input sources?
- handle exceptions thrown from semantic actions in semantics::accept?
- feature: symbol tables and parsing conditions
- feature: Adams-Nestra grammars and whitespace alignment
- feature: syntax to specify number range of allowed iteration
- optimization: tail recursion
- optimization: reduce number of false-positive left-recursive calls even further by lazily evaluating rule mandate
- optimization: additional instructions (test_char, test_any, test_range, test_class)
- more samples, testing, and bug fixing
- increase compiler warning level and fix any issues
- documentation