The contents of a format
block in a Parsley specification are
written in the Parsley grammar sub-language. This block contains a
sequence of definitions of the non-terminals in the grammar. Each
definition is separated from the subsequent definition by a ;;
marker. The last definition in the sequence should be terminated by
the terminating }
of the format
block.
An attribute system is used to annotate the non-terminals in the grammar with information that is semantically relevant in the format. The inherited attributes of a non-terminal specify the contextual information that can be used when matching data corresponding to that non-terminal. The synthesized attributes of a non-terminal, on the other hand, specify information extracted from the matched data when matching the non-terminal, and constitute the value of the non-terminal.
A non-terminal is defined in terms of a sequence of production
rules. The rules in this sequence are separated by a ;
, which
acts as a top-level ordered choice combinator. When trying to match
a non-terminal at a parsing location, each rule is tried in order, and
the first rule that succeeds is used to compute the value of the
non-terminal (i.e. the synthesized attributes for a standard
non-terminal).
A non-terminal defined with production rules that are only composed of regular expressions is called a regular expression non-terminal. Unlike standard non-terminals, such a non-terminal does not have any synthesized attributes; its type is a byte list as opposed to a record.
A standard Parsley non-terminal is given a name beginning with an
uppercase letter and is declared to have a set of synthesized
attributes, each with an explicitly specified type. A :
separates
the attribute name from its type, and these name-type pairs are listed
within {
and }
, and separated by ,
. The production rules for
the non-terminal are specified on the right-hand side of a :=
separator.
NT {s1: i32, s2: bool} := ...
A value of such a non-terminal has a record type whose fields
correspond to these synthesized attributes. In fact, a record type
can be used as an abbreviation for the declaration of the synthesized
attributes. For example, the declaration of NT
above is equivalent
to the one below:
type nt = {s1: i32, s2: bool} format { NT {nt} := ... }
A non-terminal with an empty set of attributes has a unit type.
Unit {} := ...
In this case, the {
and }
can be omitted.
Note
|
As with user-defined types, attributes cannot contain function types. |
A non-terminal could have inherited attributes. These are specified
in name-type pairs similar to synthesized attributes, except that they
are enclosed in (
and )
(i.e. in parentheses instead of braces).
NTI (i1: bool, i2: i32) {is1: i32, is2: bool} :=
Inherited attributes, if present, need to be specified before any synthesized attributes. The declaration of inherited attributes cannot be abbreviated with a record type, unlike synthesized attributes.
A non-terminal being defined can be given a short name. Indeed, naming a non-terminal is required to assign computed values to the synthesized attributes within the right-hand side of the non-terminal definition.
The non-terminal being defined can be given a short name, which can used in the right-hand side of the definition for the assignment of the computed values of the synthesized attributes.
NT n {s1: i32, s2: bool} := ... {n.s1 := ...; n.s2 := ...} ...
Such a name should start with a lowercase letter, and needs to appear immediately after the non-terminal name.
Inherited attributes are accessed directly on the right-hand side of the definition:
NTI n (i1: bool, i2: i32) {is1: i32, is2: bool} := ... {n.is1 := i2; n.is2 := i1} ...
We now turn our focus to the right-hand side of the :=
of a
non-terminal definition, which consists of a sequence of production
rules separated by the ;
ordered choice rule combinator.
Each production rule is composed of a sequence of rule elements. A rule matches successfully if each element of the rule matches the input in order.
There is occasionally a need to keep track of some state during the
matching progress of a production rule. This state may be required
only for certain rules, and is typically discarded at the end of the
rule matching. This use case is supported through temporaries,
which are variables declared and initialized at the beginning of the
rule. These are placed within (|
and |)
markers before the first
rule element. These variables are within scope for each rule element
in the rule.
... := (| s: set<i32> := Set.empty, i: i32 := 0_i32 |) ...
Each rule element can optionally be named (with an identifier that begins with a lowercase), allowing the value matched by that rule element to be used within later rule elements that are within scope.
... := ... (<a not in scope> a=RE ... <scope of a> ...) <a not in scope>
In the above example, a
is bound to some rule element RE
within a
rule element subsequence (grouped within (
and )
). a
is not in
scope before its binding in the subsequence, and also not in scope
outside the subsequence. The type of a
corresponds to the type of
the element matched by RE
.
Parsley provides some primitive rule elements, and combinators to combine them into more complex elements. The primitive rule elements are: regular expressions, non-terminals, constraints, and actions.
Regular expressions are supported using special syntax that reduces
the overhead of the attribute grammar machinery. This special syntax
needs to be bounded within the (
and )
markers. Literal
strings are a common case of regular expression, and can be used directly
as a rule element by quoting the string within a matching pair of !
markers.
... := a=(# ... #) ... !"some string"!
The special syntax for regular expressions is further described below.
A regular expression matches a list of bytes, and a variable (such as
a
above) bound to it will have the corresponding type.
A non-terminal used as a rule element matches the sequence of bytes
corresponding to its definition as an ordered choice of production
rules. A variable bound to a non-terminal has the type given to the
non-terminal; for a standard non-terminal, this will be the record
type corresponding to its synthesized attributes (or unit if it has no
synthesized attributes). If the non-terminal has inherited
attributes, their values need to be specified within a pair of <
and
>
. The value specifying expressions for an inherited attributed can
be expressed in terms of all the attributes and bound rule elements in
scope at that point.
The example below illustrates the use, as rule elements, of the NT
and NTI
non-terminals discussed above. Both n
and ni
will have a
record type, corresponding to the types of NT
and NTI
respectively.
... := ... n=NT ... ni=NTI<i1=bool::False(), i2=3> ...
Constraints are boolean-valued expressions placed within [
and ]
markers. These expressions are expressed in terms of all the
attributes and bound rule elements in scope at that point. The
constraint rule element does not match any input, but instead controls
the matching progress of the rule element subsequence it occurs in.
If it evaluates to true, the matching continues to the next rule
element in the subsequence. Otherwise, it unwinds to the innermost
ordered choice combinator, and starts with the rule or rule element in
the choice.
Note
|
It is not very useful to bind a variable to a constraint rule element. Its value will always be true in its scope, since the rule elements in its scope will not be executed if it evaluates to false. |
Constraints are similar to but differ from dynamic assertions, described below.
Actions are blocks, demarcated by enclosing {
and }
markers, within
which some computation is done. This computation typically does one
or more of: assign values to the synthesized attributes of the
non-terminal being defined, update a temporary, or return a
value to the variable binding the action block. Action blocks do not
match any input. An action consists of a (possibly empty) sequence of
assignment statements separated by ;
. If a value needs to be
returned to a binding for the block, the expression computing the
value should be specified after a ;;
marker placed after the last
statement (if any). If there is no such expression, an action binder
gets a value of the unit
type.
Assignment statements are of the form lvalue := rvalue
, where
rvalue
is an expression. lvalue
is typically a synthesized
attribute or a temporary. For convenience, these assignment
sequences can be placed in the bodies of let
bindings or the branch
bodies of case
discriminators.
Act a {v: i32, w: i32} := b=[bool::True()] { let c = b in (case c of | bool::True() -> {a.v := 0_i32; a.w := 1_i32} | bool::False() -> {a.v := 1_i32; a.w := 0_i32} ) }
To aid debugging using the interpreter, an additional statement form
is the print statement, whose syntax is $print(e)
where e
is any
expression. When this statement is executed in the interpreter, the
value that e
evaluates to is printed to the standard error. A
sequence of bytes in the value is interpreted as binary data and
printed in hex. However, in some cases, sequences of bytes could
contain ASCII characters and hence may need to be printed as strings;
this use case is served by $print_t(e)
.
Dynamic assertions are like constraints in that they are required to be satisfied in order for the parse to continue. However, a dynamic assertion is unlike a constraint in the following ways:
-
It is placed within
%[
and]%
syntactic markers. -
It is not an arbitrary boolean expression; instead, it needs to be one of a set of syntactically restricted expressions, described below.
-
If the condition in the assertion does not hold, the execution of the parse is suspended, and control is returned to the application. The application is provided a reason for the suspension, and can decide to change the environment in order to make the assertion true dynamically. The application can return control to the parser in order resume the paused execution. The parser resumes execution at the point of the pause, after confirming that the asserted condition holds.
Currently, the only supported dynamic assertion is
require_remaining
, specified as:
... := ... %[ require_remaining(v, e) ]% ...
This asserts that the view in v
should have at least e
number of
bytes beyond the current offset, where e
is an expression of type
usize
. Note that v
is required to be an open view, i.e. a
view that can be extended by adding data to the end of the backing
buffer. If v
is such a view, and the condition does not hold,
control returns to the application, which can append more data to the
end of the view (e.g. by reading more data from a network socket or a
pipe) in order to satisfy the condition. The application can then
return control to Parsley, which in turn checks the condition and
resumes the parse immediately after this rule element.
Note
|
If v is not an open view, the parse terminates immediately
with an appropriate error, as this is most likely a bug in the Parsley
specification.
|
As mentioned above, these primitive rule elements can be combined into more complex elements through combinators. We’ve already come across the simplest combinator, which is sequence, represented as whitespace between rule elements. A sequence of rule elements is matched by matching each rule element in order; if any element fails, the sequence fails. If a variable is bound to a sequence, its value is a tuple containing the values of each rule element, and hence its type is a tuple of the corresponding types. However, in the special case when all the rule elements are formed from regular expressions, the sequence is itself considered a regular expression, and its value is a flattened byte list formed from the values of the individual expressions in the sequence.
For convenience, rule elements can be grouped between (
and )
when
composing with combinators.
The ordered choice combinator is denoted by an infix |
between the
rule elements comprising the choice. The value matched by this
combinator is the value matched by the first successfully matching
rule element in the choice; the combinator fails if every element
fails to match. If a variable is bound to this combinator, each
element in the choice is constrained to have the same type; if there
is no binding, there is no such constraint. Again, regular
expressions form a special case: if every element in the choice is a
regular expression, the combinator itself is treated as a regular
expression.
Note
|
The | is an ordered choice combinator for rule elements, while
; is an ordered choice combinator for rules. The difference between
the two is only for specification and syntactic convenience, and they
function the same in terms of the dynamic matching semantics.
|
Note
|
The ... := B C | D E is parsed as ... := (B C) | (D E) |
The unbounded repeat combinator is denoted by a postfix *
applied
to a rule element. The combinator matches zero or more matches of the
element, and hence can never fail. If the type of the match returned
by the element is t
, the type of the value matched by the combinator
is the list type [t]
. However, if the element is a regular
expression, the combinator itself is considered a regular expression,
and the value matched by the combinator is a flattened list of the
matches of the element.
The bounded repeat combinator is denoted by the caret ^
, placed
infix between a rule element and an integer-valued bound expression.
If the bound expression evaluates to n
during matching, this
combinator matches exactly n
consecutive matches of the element,
otherwise it fails. As in the case of the unbounded repeat, the type
of this combinator is [t]
, where t
is the type of the element,
except in the case of a regular expression element, in which case the
type is a byte list, and the value is a flattened list of the n
matches.
The option combinator is denoted by a postfix ?
applied to a rule
element. The combinator matches at most one match of the element, and
hence can never fail. If the type of the match returned by the
element is t
, the type of the value matched by the combinator is
option<t>
. If the element is a regular expression, however, the
combinator itself is a regular expression, and its value is an empty
list on failure.
The matching performed by a rule element can be restricted to a view
using the view-mapping combinators. By default, all parsing is
performed at the current cursor location within an implicit current
view. View.get_current
returns a copy of this current view, and
View.get_current_cursor
provides the integer value (of type usize
)
of the current cursor in the current view. View.get_base
provides a
copy of the current view with the cursor location set to the start of
the buffer. View.make_current
takes a view and makes it the current
view.
The at-position combinator @(p, RE)
operates on an integer
expression p
and a rule element RE
. This combinator creates a new
view, say v
, from the current view, with its cursor pointing at the
integer offset (of type usize
) given by the evaluation of p
. The
value returned by the combinator is the value matched by RE
in v
.
Hence, the type of this combinator application is the same as the type
of RE
. The combinator fails if RE
fails to match. After the
combinator is processed, whether to success or failure, the view v
created for the combinator is discarded, and the current view active
before the combinator was processed is unmodified and remains current.
Note
|
Within the processing of RE , a call to View.get_current will
typically return a view derived from v .
|
Caution
|
If p generates an offset that is out of bounds of the
current view, an exception is thrown as described in
Exceptions.
|
The at-view combinator @[v, RE]
operates on a view v
and a rule
element RE
. It operates similarly to the at-position combinator,
except that no new view is created and v
is directly used.
The map-views combinator @#[views, RE]
operates on a list of views
given by views
. Conceptually, it is a map of @[v, RE]
for each
v
in views
, and returns a list of the corresponding matches.
Hence, if a match of RE
has type t
, then the type of the match
returned by this combinator is [t]
. The map-views combinator fails
if RE
fails at any view in views
.
The side-effects of parsing on the views used in a view-mapping combinator are not visible outside the combinator.
T t {i: usize, j: usize} := Byte v={;; View.restrict(View.get_current(), 0u, 4u)} @[v, UInt32<endian::endian::Little()>] { t.i := View.get_current(v); t.j := View.get_remaining(v) }
In the above example, t.i
and t.j
will be 0u
and 4u
respectively. If the side-effects had been visible, the values would
instead be 4u
and 0u
.
When a map-view combinator is used with a non-terminal with inherited
attributes, it is often convenient to provide different values for one
or more inherited attributes for the different views in the map. This
can be done using the ←
assignment operator. The right hand side
of the ←
operator should be an expression in parentheses that
evaluates to a list of values that is of the same length as the number
of views in the map. For each view in the map, the corresponding
value in the list is provided to the inherited attribute.
NT n (i: i32, j: usize) {n: unit} := ... ;; T t (vs: [view], is: [i32]) {t: unit} := @#[vs, NT<i <- (is), j = 0u>] ...
In the fragment above, when NT
is mapped over the list of views vs
in the production rule for T
, its i
attribute is provided a value
in is
at the index corresponding to the index of the view in vs
in
which NT
is being applied.
Note
|
The ← assignment operator can only be used with a
non-terminal directly under a map-views combinator; i.e. it can only
be used in @#[views, RE] when RE is a non-terminal.
|
Note
|
An internal constraint is generated to ensure that any lists
used with the NT n (i: i32, j: usize) {n: unit} := ... ;; T t (vs: [view], is: [i32]) {t: unit} := [List.length(vs) = List.length(is)] @#[vs, NT<i <- (is), j = 0u>] ... The insertion of an internal constraint may result in unexpected types for named rule elements when complex constructs are bound. To avoid this, this construct should follow the most common use-case for binding names to rule elements, where the result of the map-view is bound directly. For example, the below named binding may result in the type of T t (vs: [view], is: [i32]) {t: unit} := a=(Byte @#[vs, NT<i <- (is), j = 0u>] ) ... To avoid this, named bindings should be as specific as possible, and
should bind the map-view directly. The naming below will result in
T t (vs: [view], is: [i32]) {t: unit} := a=Byte b=@#[vs, NT<i <- (is), j = 0u>] ... |
The set-view operator @^[v]
sets the current view to v
.
Subsequent matching will be performed in this view until the current
view is altered.
A regular expression rule element can use a special syntax within a
block enclosed by (
and )
to form a composite expression using
regular expression combinators. The primitive expressions are literal
sets, wildcards, and non-terminals, and can be composed using the
usual unbounded repeat (*
), bounded repeat (^
), option (?
),
choice (|
) and sequence (` `) combinators. These combinators are
denoted by the same symbol used by the corresponding rule element
combinator. As usual, `(` and )
can be used for grouping purposes.
Note
|
In regular expressions, bounds for the repeat combinator (^ )
need to be constant expressions. In addition, the choice (| )
operator implements unordered or unbiased choice, unlike the choice
operator in the attribute grammar.
|
A literal set is enclosed within [
and ]
markers, and is
composed from individual literal ASCII strings and other literal sets,
using the choice, difference, and range operators. Within a
literal, conventional escapes can be used to denote characters for
linefeed (\n
), carriage-return (\r
), and horizontal tab (\t
).
Numerical escape-codes are also supported in the form of decimal codes
(\ddd
, where d
is a decimal), hexadecimal codes (\xhh
where h
is hexadecimal), and octal codes (\occc
where c is octal).
Note
|
Currently, the " double-quote character can only be specified
with a numerical escape code (e.g. \034 , \x22 , or \o042 ).
|
The choice operator, denoted with an infix |
, matches one of a
given set of literal strings.
["A" | "B" | "C"]
A character class is a set of related literal characters that is
named for convenience. The character classes available in the
standard library are AsciiCharS
, DigitS
, HexCharS
and
AlphaNumS
.
It is often convenient to use all but a few of the characters from a
character class. This can be done using the difference operator,
denoted with an infix \
. It denotes the difference between the
character class specified as the left operand and the literal set
given as the right operand. For example, the below literal set
matches any ASCII character except (
.
[AsciiCharS \ "("]
The range operator, denoted with an infix ..
, matches any string
covered by the range between the left literal set and the right
literal set.
["10" .. "99"]
Note
|
There are restrictions on the kinds of arguments taken by these operators to ensure that the resulting literal set can be computed and matched efficiently. |
The wildcard regular expression, denoted , matches any single
byte. The matched value returned by the wildcard will be a single
element byte list containing the matched byte. (The
symbol is
used instead of the more common
.
, since the .
character is
sometimes hard to see in an editor, and it conflicts with the .
notation used in the module system for the grammar sublanguage.)
A regular expression non-terminal can be used as a regular expression element composable with other regular expressions using the regular expression combinators.
Although scanning for specific tag bytes can be expressed as a regular
expression, the regular expression can be complicated and hence
inconvenient to write by hand. There is explicit support for such use
cases using the forward scan (denoted with /sf
) and backward
scan (denoted with /sb
) operators. A forward scan for the bytes
tag
would be written as /sf["tag"]
; similarly, the corresponding
backward scan would be /sb["tag"]
. After a successful forward scan,
the cursor would be positioned on the last tag byte (g
in the case
of /sf["tag"]
) example), whereas after a successful backward scan,
it is on the first tag byte (t
in the case of /sb["tag"]
). The
bytes returned as the match correspond to all the bytes from the
starting cursor position to the final position, inclusive. Matched
bytes are always returned in forwards order, i.e. from the smallest
matching cursor position to the largest.
For example, if the view contains the bytes 123abcxyz
and the cursor
is positioned on the a
, /sf["xy"]
will return abcxy
as matched
bytes, leaving the cursor positioned on y
, while /sb["12"]
from
the same cursor position will return 123a
as matched bytes, leaving
the cursor on 1
.
As mentioned above, the definition of a non-terminal is given by a
sequence of rules composed by the ;
ordered-choice operator, where
each rule is composed of rule elements. The key characteristic of a
rule is that it is a self-contained specification of one alternative
way of parsing that non-terminal. The self-contained property means
that the values of all of the non-terminal’s synthesized attributes
have been computed and assigned by the end of the rule. Synthesized
attributes are not implicitly initialized to default values; they
need to be explicitly initialized if necessary.
Synthesized attributes are assigned values in action blocks, which can
be composed with other rule elements using combinators. It is useful
to keep in mind how the properties of some combinators interact with
action blocks containing such attribute assignments. In particular,
some combinators like the unbounded repeat *
could succeed without
executing the rule elements in its immediate scope (i.e. no other
combinator under the *
has scope over the elements). For example,
the specification below has two errors:
ByteVec bv {v: [byte]} := (b=Byte { bv.v := b :: bv.v })*
The first error is that if the sequence under the *
is executed the
first time, b
is prefixed to the uninitialized attribute bv.v
. The
second error is that it is possible that the production rule succeeds
without matching any byte (e.g. when it is at the end of input). In
that case, the v
attribute would remain unassigned at the end of the
rule. A specification that assigned a default value to v
would fix
both these errors. This assignment can be done either in the
attribute declaration itself, or with an initial action block.
ByteVec bv {v: [byte] := []} := (b=Byte { bv.v := b :: bv.v })* ByteVec bv {v: [byte]} := { bv.v := [] } (b=Byte { bv.v := b :: bv.v })*
The ?
option combinator also has the property that it could succeed
without matching any input, resulting in any attribute assigments in
action blocks under its scope not being executed after a successful
match. A similar consideration applies to the |
ordered choice
operator, if attribute assignments are done in some of the choices but
not others.
Unlike the unbounded repeat *
, the bounded repeat combinator ^
should execute the rule elements in its immediate scope if the bound
is greater than 0
. Parsley is able to take this into account when
the bound is a constant non-zero expression. However, when the bound
is computed from other entities in the context, Parsley takes a
conservative view and assumes that the bound could be zero, i.e. that
any action elements in its immediate scope may not execute.
In the case of the map-views combinator @#[views, RE]
, any action
elements in RE
do not execute if views
is an empty list. Parsley
currently cannot statically determine if the list is always non-empty,
and so also takes a conservative view: it assumes that views
could
be empty, and that any action blocks in RE
may not execute.
Bits can be matched within production rules; however, all non-bit-oriented constructs need to be byte-aligned. An alignment specifier can be used to restore this alignment.
A bitvector of length n
can be matched using the non-terminal
BitVector<n>
. Bits in bitvectors are indexed from 0, with 0 mapped
to the least-significant bit. So, for example, the bits in a
BitVector<8>
that matches 0x0F
are arranged from 7:0
, with bits
in 7:4
matching 0b0000
and those in 3:0
matching 0b1111
.
A user-defined bitfield bf
covering n
bits can be matched using
the specifier $bitfield(bf)
, which converts the n
matched bits
into a value of the bf
bitfield type.
The $align<n>$
operator can be used to skip over bits to the next
byte boundary lying at a multiple of n
bits, where n
must be a
multiple of 8. For example, the sequence BitVector<1> $align<64>
,
when started on a byte boundary, matches 1 bit and then skips over 63
bits to end at a cursor position 64 bits from the starting point.
A sequence of bits can be matched using a sequence of BitVectors and bitfield specifiers. However, this sequence needs to start and end on a byte boundary. Parsley requires that within a rule, all regular expressions, non-terminals, and invocations of choice, star, option, and view-mapping combinators need to occur at byte-aligned positions in the parsing buffer.
If a failure is experienced while parsing a sequence of bits, the next parsing choice continues at the byte-aligned location at which the bit matching started.
An example of the use of the bitvector and bitfield support is shown
below. The rule for non-terminal N
matches 24 bits, with the first
byte being ignored, the next 7 bits stored in attribute v
, and the
next 9 bits converted into a bitfield bf
and stored in attribute
f
.
bitfield bf = { top: 8:7, bot: 6:0 } format { N n {v: bitvector<7>, f: bf} := BitVector<1> $align<8> v=BitVector<7> f=$bitfield(bf) { n.v := v; n.f := f } }
If $bitfield(bf)
matches 0b100000011
above, f.top
will be 0b10
and f.bot
will be 0b0000011
.
Endianness concerns arise when using bitvectors that are longer than 8 bits or that straddle byte boundaries. The bytes containing the bits being parsed are traversed in a left-to-right manner, i.e. in the direction of increasing byte offset into the parse buffer. This implies that the bytes in a long bitvector appear in a big-endian order. This needs to be accounted for when converting bitvectors into integers.
For example, if the non-terminal Bits
below is used to parse the
four bytes 0x11000102
, the values of b.v1
, b.v2
and b.v3
will be 0_u8
, 17_u8
and 258_u32
respectively.
format { Bits b {v1: u8, v2: u8, v3: u32} := v1=BitVector<3> v2=BitVector<5> v3=BitVector<24> // matches 4-bytes { b.v1 := Bits.to_u8(v1); b.v2 := Bits.to_u8(v2); b.v3 := Bits.to_u32(v3) } }
The Bits
module contains integer conversion routines that take an
endianness specification, such as to_u16_endian
,
to_i32_endian
, etc. The first or leftmost eight bits in a
little-endian bitvector are considered to form the least-significant
byte, whereas for a big-endian bitvector, the last or rightmost eight
bits are considered to form the least-significant byte.
A little-endian conversion of the 24-bit v3
can be computed using
Bits.to_u32_endian
, resulting in b.v3
taking the value 131328_u32
.
b.v3 := Bits.to_u32_endian(endian::Little(), v3)
As the example above shows, the default integer conversion of a parsed bitvector treats the bytes within it in big-endian order.
Parsley provides an annotation scheme that can be used to provide auxiliary information for non-terminals. Each non-terminal can be annotated via a decorator, using the syntax below:
#[deco_type1(deco_val1:key1=val1,key2,key3=val3);deco_type2(deco_val2: ...)] NT ... := ...
The decorator is specified immediately before the non-terminal it
applies to, and consists of a sequence of decorator specifiers
separated by ;
. Each specifier has a decorator type and its
decorator value, with the value enclosed within a pair of
parentheses (
)
The value can be optionally be parameterized by a
list of keys with optional values that are separated by ,
. If this
parameter list is present, it is separated from the value by a :
,
and a parameter key
with a value
is denoted simply with
key=value
.
All the entities in the decorator are simple identifiers; i.e. expression syntax cannot be used as types, keys or values.
A decorator cannot have two specifiers with the same decorator type, and a decorator value cannot have two keys with the same name.
Parsley only interprets decorators that it supports; other decorators are silently ignored.
Note: The decorator "type" identifier is unrelated any identifier in the Parsley type system.
Unlike parsers for programming languages that are designed to be readable by humans, Parsley is normally used for specify data formats that are often optimized for machine processing rather than human readability. As a result, it does not have any implicit notion of whitespace. However, one often needs to specify data formats that can have fragments that use a notion of whitespace, and the grammar elements in these formats can be tedious to specify when every occurence of whitespace needs to be explicitly specified.
To help with this use case, Parsley supports annotating a non-terminal
(say NT
) with a whitespace
decorator that names another
non-terminal (say WS
) that can be considered to separate the rule
elements in each rule of NT
. The goal here is to recover the
convenience of an implicit notion of whitespace when specifying the
rules of the non-terminal NT
, and making explicit this notion of
whitespace just once in the decorator.
As an example, suppose the following definition was used to define the
WS
non-terminal to match possibly empty whitespace:
WS (allow_empty: bool) := [allow_empty] (# [" " | "\t" | "\r" | "\n"]* #) ; [!allow_empty] (# [" " | "\t" | "\r" | "\n"]+ #) ;;
The inherited attribute allow_empty
controls whether WS
can
match empty whitespace.
Then it can be used to specify the whitespace separating the rule
elements for a non-terminal NT
as follows:
#[whitespace(WS:allow_empty=true)] NT := A B ; ((A C) | (B D))* ;;
This is equivalent to the following definition of NT
without a
whitespace decorator, which is quite verbose.
NT := // first rule after decoration WS<allow_empty=bool::True()> A WS<allow_empty=bool::True()> B WS<allow_empty=bool::True()> // second rule after decoration ; WS<allow_empty=bool::True()> ( (A WS<allow_empty=bool::True()> C WS<allow_empty=bool::True>()) | (B WS<allow_empty=bool::True()> D WS<allow_empty=bool::True>()) )* WS<allow_empty = bool::True> ;;
Parsley uses some heuristics to insert the specified whitespace non-terminal into the decorated rules, which in most cases should give the same result as a parser with an implicit notion of whitespace.
Caution
|
There are some situations where the decorator should only be used with considerable caution, if at all:
In these situations, it is likely that the automatically inserted whitespace might violate assumptions about byte-alignment or cursor locations in views and constraints, or about matching optional whitespace. |
Parsley uses some heuristics when inserting whitespace in such rules. For example, it does not introduce whitespace immediately before constraints or any view mapping combinator, and before or after any construct dealing with bitvectors.
Note
|
One can use the -dd option to print the rules of a
non-terminal after whitespace decoration to ensure they align with
expectations.
|
Parsley supports the use of definitions from one Parsley specification, considered as a module, within another Parsley specification; this is described further in the expression language documentation.
As described there, the non-terminal used as a rule element in a
production rule for a non-terminal definition can refer to a
non-terminal defined in another module: M.NT
refers to the
non-terminal NT
defined in module M
.
Unlike non-terminals, Parsley does not support cross-module use of definitions for regular expressions and literals. A specification can however refer to regular expressions and literals defined in the Parsley standard library.