Skip to content

Latest commit

 

History

History
845 lines (727 loc) · 36.2 KB

grammar.adoc

File metadata and controls

845 lines (727 loc) · 36.2 KB

Grammar sub-language

The contents of a format block in a Parsley specification are written in the Parsley grammar sub-language. This block contains a sequence of definitions of the non-terminals in the grammar. Each definition is separated from the subsequent definition by a ;; marker. The last definition in the sequence should be terminated by the terminating } of the format block.

An attribute system is used to annotate the non-terminals in the grammar with information that is semantically relevant in the format. The inherited attributes of a non-terminal specify the contextual information that can be used when matching data corresponding to that non-terminal. The synthesized attributes of a non-terminal, on the other hand, specify information extracted from the matched data when matching the non-terminal, and constitute the value of the non-terminal.

A non-terminal is defined in terms of a sequence of production rules. The rules in this sequence are separated by a ;, which acts as a top-level ordered choice combinator. When trying to match a non-terminal at a parsing location, each rule is tried in order, and the first rule that succeeds is used to compute the value of the non-terminal (i.e. the synthesized attributes for a standard non-terminal).

A non-terminal defined with production rules that are only composed of regular expressions is called a regular expression non-terminal. Unlike standard non-terminals, such a non-terminal does not have any synthesized attributes; its type is a byte list as opposed to a record.

Defining Non-terminals

A standard Parsley non-terminal is given a name beginning with an uppercase letter and is declared to have a set of synthesized attributes, each with an explicitly specified type. A : separates the attribute name from its type, and these name-type pairs are listed within { and }, and separated by ,. The production rules for the non-terminal are specified on the right-hand side of a := separator.

 NT {s1: i32, s2: bool} := ...

A value of such a non-terminal has a record type whose fields correspond to these synthesized attributes. In fact, a record type can be used as an abbreviation for the declaration of the synthesized attributes. For example, the declaration of NT above is equivalent to the one below:

 type nt = {s1: i32, s2: bool}

 format {
   NT {nt} := ...
 }

A non-terminal with an empty set of attributes has a unit type.

 Unit {} := ...

In this case, the { and } can be omitted.

Note
As with user-defined types, attributes cannot contain function types.

A non-terminal could have inherited attributes. These are specified in name-type pairs similar to synthesized attributes, except that they are enclosed in ( and ) (i.e. in parentheses instead of braces).

 NTI (i1: bool, i2: i32) {is1: i32, is2: bool} :=

Inherited attributes, if present, need to be specified before any synthesized attributes. The declaration of inherited attributes cannot be abbreviated with a record type, unlike synthesized attributes.

A non-terminal being defined can be given a short name. Indeed, naming a non-terminal is required to assign computed values to the synthesized attributes within the right-hand side of the non-terminal definition.

The non-terminal being defined can be given a short name, which can used in the right-hand side of the definition for the assignment of the computed values of the synthesized attributes.

 NT n {s1: i32, s2: bool} := ... {n.s1 := ...; n.s2 := ...} ...

Such a name should start with a lowercase letter, and needs to appear immediately after the non-terminal name.

Inherited attributes are accessed directly on the right-hand side of the definition:

 NTI n (i1: bool, i2: i32) {is1: i32, is2: bool} := ... {n.is1 := i2; n.is2 := i1} ...

Rules and Rule Elements

We now turn our focus to the right-hand side of the := of a non-terminal definition, which consists of a sequence of production rules separated by the ; ordered choice rule combinator.

Each production rule is composed of a sequence of rule elements. A rule matches successfully if each element of the rule matches the input in order.

There is occasionally a need to keep track of some state during the matching progress of a production rule. This state may be required only for certain rules, and is typically discarded at the end of the rule matching. This use case is supported through temporaries, which are variables declared and initialized at the beginning of the rule. These are placed within (| and |) markers before the first rule element. These variables are within scope for each rule element in the rule.

... := (| s: set<i32> := Set.empty, i: i32 := 0_i32 |) ...

Each rule element can optionally be named (with an identifier that begins with a lowercase), allowing the value matched by that rule element to be used within later rule elements that are within scope.

... := ... (<a not in scope> a=RE ... <scope of a> ...) <a not in scope>

In the above example, a is bound to some rule element RE within a rule element subsequence (grouped within ( and )). a is not in scope before its binding in the subsequence, and also not in scope outside the subsequence. The type of a corresponds to the type of the element matched by RE.

Parsley provides some primitive rule elements, and combinators to combine them into more complex elements. The primitive rule elements are: regular expressions, non-terminals, constraints, and actions.

Regular expressions are supported using special syntax that reduces the overhead of the attribute grammar machinery. This special syntax needs to be bounded within the ( and ) markers. Literal strings are a common case of regular expression, and can be used directly as a rule element by quoting the string within a matching pair of ! markers.

... := a=(# ... #) ... !"some string"!

The special syntax for regular expressions is further described below. A regular expression matches a list of bytes, and a variable (such as a above) bound to it will have the corresponding type.

A non-terminal used as a rule element matches the sequence of bytes corresponding to its definition as an ordered choice of production rules. A variable bound to a non-terminal has the type given to the non-terminal; for a standard non-terminal, this will be the record type corresponding to its synthesized attributes (or unit if it has no synthesized attributes). If the non-terminal has inherited attributes, their values need to be specified within a pair of < and >. The value specifying expressions for an inherited attributed can be expressed in terms of all the attributes and bound rule elements in scope at that point.

The example below illustrates the use, as rule elements, of the NT and NTI non-terminals discussed above. Both n and ni will have a record type, corresponding to the types of NT and NTI respectively.

... := ... n=NT ... ni=NTI<i1=bool::False(), i2=3> ...

Constraints are boolean-valued expressions placed within [ and ] markers. These expressions are expressed in terms of all the attributes and bound rule elements in scope at that point. The constraint rule element does not match any input, but instead controls the matching progress of the rule element subsequence it occurs in. If it evaluates to true, the matching continues to the next rule element in the subsequence. Otherwise, it unwinds to the innermost ordered choice combinator, and starts with the rule or rule element in the choice.

Note
It is not very useful to bind a variable to a constraint rule element. Its value will always be true in its scope, since the rule elements in its scope will not be executed if it evaluates to false.

Constraints are similar to but differ from dynamic assertions, described below.

Actions are blocks, demarcated by enclosing { and } markers, within which some computation is done. This computation typically does one or more of: assign values to the synthesized attributes of the non-terminal being defined, update a temporary, or return a value to the variable binding the action block. Action blocks do not match any input. An action consists of a (possibly empty) sequence of assignment statements separated by ;. If a value needs to be returned to a binding for the block, the expression computing the value should be specified after a ;; marker placed after the last statement (if any). If there is no such expression, an action binder gets a value of the unit type.

Assignment statements are of the form lvalue := rvalue, where rvalue is an expression. lvalue is typically a synthesized attribute or a temporary. For convenience, these assignment sequences can be placed in the bodies of let bindings or the branch bodies of case discriminators.

  Act a {v: i32, w: i32} :=
    b=[bool::True()]
    { let c = b in
      (case c of
      | bool::True()  -> {a.v := 0_i32; a.w := 1_i32}
      | bool::False() -> {a.v := 1_i32; a.w := 0_i32}
      )
    }
Printing

To aid debugging using the interpreter, an additional statement form is the print statement, whose syntax is $print(e) where e is any expression. When this statement is executed in the interpreter, the value that e evaluates to is printed to the standard error. A sequence of bytes in the value is interpreted as binary data and printed in hex. However, in some cases, sequences of bytes could contain ASCII characters and hence may need to be printed as strings; this use case is served by $print_t(e).

Dynamic assertions

Dynamic assertions are like constraints in that they are required to be satisfied in order for the parse to continue. However, a dynamic assertion is unlike a constraint in the following ways:

  • It is placed within %[ and ]% syntactic markers.

  • It is not an arbitrary boolean expression; instead, it needs to be one of a set of syntactically restricted expressions, described below.

  • If the condition in the assertion does not hold, the execution of the parse is suspended, and control is returned to the application. The application is provided a reason for the suspension, and can decide to change the environment in order to make the assertion true dynamically. The application can return control to the parser in order resume the paused execution. The parser resumes execution at the point of the pause, after confirming that the asserted condition holds.

Currently, the only supported dynamic assertion is require_remaining, specified as:

... := ... %[ require_remaining(v, e) ]% ...

This asserts that the view in v should have at least e number of bytes beyond the current offset, where e is an expression of type usize. Note that v is required to be an open view, i.e. a view that can be extended by adding data to the end of the backing buffer. If v is such a view, and the condition does not hold, control returns to the application, which can append more data to the end of the view (e.g. by reading more data from a network socket or a pipe) in order to satisfy the condition. The application can then return control to Parsley, which in turn checks the condition and resumes the parse immediately after this rule element.

Note
If v is not an open view, the parse terminates immediately with an appropriate error, as this is most likely a bug in the Parsley specification.

Combinators

As mentioned above, these primitive rule elements can be combined into more complex elements through combinators. We’ve already come across the simplest combinator, which is sequence, represented as whitespace between rule elements. A sequence of rule elements is matched by matching each rule element in order; if any element fails, the sequence fails. If a variable is bound to a sequence, its value is a tuple containing the values of each rule element, and hence its type is a tuple of the corresponding types. However, in the special case when all the rule elements are formed from regular expressions, the sequence is itself considered a regular expression, and its value is a flattened byte list formed from the values of the individual expressions in the sequence.

For convenience, rule elements can be grouped between ( and ) when composing with combinators.

The ordered choice combinator is denoted by an infix | between the rule elements comprising the choice. The value matched by this combinator is the value matched by the first successfully matching rule element in the choice; the combinator fails if every element fails to match. If a variable is bound to this combinator, each element in the choice is constrained to have the same type; if there is no binding, there is no such constraint. Again, regular expressions form a special case: if every element in the choice is a regular expression, the combinator itself is treated as a regular expression.

Note
The | is an ordered choice combinator for rule elements, while ; is an ordered choice combinator for rules. The difference between the two is only for specification and syntactic convenience, and they function the same in terms of the dynamic matching semantics.
Note

The | ordered choice binds tighter than the sequence operator. Hence

... := B C | D E

is parsed as B (C | D) E. Grouping should be used to indicate choice between sequences:

... := (B C) | (D E)

The unbounded repeat combinator is denoted by a postfix * applied to a rule element. The combinator matches zero or more matches of the element, and hence can never fail. If the type of the match returned by the element is t, the type of the value matched by the combinator is the list type [t]. However, if the element is a regular expression, the combinator itself is considered a regular expression, and the value matched by the combinator is a flattened list of the matches of the element.

The bounded repeat combinator is denoted by the caret ^, placed infix between a rule element and an integer-valued bound expression. If the bound expression evaluates to n during matching, this combinator matches exactly n consecutive matches of the element, otherwise it fails. As in the case of the unbounded repeat, the type of this combinator is [t], where t is the type of the element, except in the case of a regular expression element, in which case the type is a byte list, and the value is a flattened list of the n matches.

The option combinator is denoted by a postfix ? applied to a rule element. The combinator matches at most one match of the element, and hence can never fail. If the type of the match returned by the element is t, the type of the value matched by the combinator is option<t>. If the element is a regular expression, however, the combinator itself is a regular expression, and its value is an empty list on failure.

View-mapping Combinators

The matching performed by a rule element can be restricted to a view using the view-mapping combinators. By default, all parsing is performed at the current cursor location within an implicit current view. View.get_current returns a copy of this current view, and View.get_current_cursor provides the integer value (of type usize) of the current cursor in the current view. View.get_base provides a copy of the current view with the cursor location set to the start of the buffer. View.make_current takes a view and makes it the current view.

The at-position combinator @(p, RE) operates on an integer expression p and a rule element RE. This combinator creates a new view, say v, from the current view, with its cursor pointing at the integer offset (of type usize) given by the evaluation of p. The value returned by the combinator is the value matched by RE in v. Hence, the type of this combinator application is the same as the type of RE. The combinator fails if RE fails to match. After the combinator is processed, whether to success or failure, the view v created for the combinator is discarded, and the current view active before the combinator was processed is unmodified and remains current.

Note
Within the processing of RE, a call to View.get_current will typically return a view derived from v.
Caution
If p generates an offset that is out of bounds of the current view, an exception is thrown as described in Exceptions.

The at-view combinator @[v, RE] operates on a view v and a rule element RE. It operates similarly to the at-position combinator, except that no new view is created and v is directly used.

The map-views combinator @#[views, RE] operates on a list of views given by views. Conceptually, it is a map of @[v, RE] for each v in views, and returns a list of the corresponding matches. Hence, if a match of RE has type t, then the type of the match returned by this combinator is [t]. The map-views combinator fails if RE fails at any view in views.

The side-effects of parsing on the views used in a view-mapping combinator are not visible outside the combinator.

  T t {i: usize, j: usize} :=
    Byte
    v={;; View.restrict(View.get_current(), 0u, 4u)}
    @[v, UInt32<endian::endian::Little()>]
    { t.i := View.get_current(v);
      t.j := View.get_remaining(v) }

In the above example, t.i and t.j will be 0u and 4u respectively. If the side-effects had been visible, the values would instead be 4u and 0u.

When a map-view combinator is used with a non-terminal with inherited attributes, it is often convenient to provide different values for one or more inherited attributes for the different views in the map. This can be done using the assignment operator. The right hand side of the operator should be an expression in parentheses that evaluates to a list of values that is of the same length as the number of views in the map. For each view in the map, the corresponding value in the list is provided to the inherited attribute.

  NT n (i: i32, j: usize) {n: unit} := ... ;;

  T t (vs: [view], is: [i32]) {t: unit} :=
      @#[vs, NT<i <- (is), j = 0u>]
      ...

In the fragment above, when NT is mapped over the list of views vs in the production rule for T, its i attribute is provided a value in is at the index corresponding to the index of the view in vs in which NT is being applied.

Note
The assignment operator can only be used with a non-terminal directly under a map-views combinator; i.e. it can only be used in @#[views, RE] when RE is a non-terminal.
Note

An internal constraint is generated to ensure that any lists used with the operator have the same length as the number of views in the map-view. In the example above, this internal constraint ensures that vs and is have the same lengths. That is, the above fragment is equivalent to:

  NT n (i: i32, j: usize) {n: unit} := ... ;;

  T t (vs: [view], is: [i32]) {t: unit} :=
      [List.length(vs) = List.length(is)]
      @#[vs, NT<i <- (is), j = 0u>]
      ...

The insertion of an internal constraint may result in unexpected types for named rule elements when complex constructs are bound. To avoid this, this construct should follow the most common use-case for binding names to rule elements, where the result of the map-view is bound directly.

For example, the below named binding may result in the type of a being a 3-tuple instead of a 2-tuple, due to the internal list length constraint being including in the a binding.

  T t (vs: [view], is: [i32]) {t: unit} :=
      a=(Byte
         @#[vs, NT<i <- (is), j = 0u>]
        )
      ...

To avoid this, named bindings should be as specific as possible, and should bind the map-view directly. The naming below will result in a and b getting their expected types, and will not be affected by the internal list-length constraint.

  T t (vs: [view], is: [i32]) {t: unit} :=
      a=Byte
      b=@#[vs, NT<i <- (is), j = 0u>]
      ...

The set-view operator @^[v] sets the current view to v. Subsequent matching will be performed in this view until the current view is altered.

Regular Expressions

A regular expression rule element can use a special syntax within a block enclosed by ( and ) to form a composite expression using regular expression combinators. The primitive expressions are literal sets, wildcards, and non-terminals, and can be composed using the usual unbounded repeat (*), bounded repeat (^), option (?), choice (|) and sequence (` `) combinators. These combinators are denoted by the same symbol used by the corresponding rule element combinator. As usual, `(` and ) can be used for grouping purposes.

Note
In regular expressions, bounds for the repeat combinator (^) need to be constant expressions. In addition, the choice (|) operator implements unordered or unbiased choice, unlike the choice operator in the attribute grammar.

A literal set is enclosed within [ and ] markers, and is composed from individual literal ASCII strings and other literal sets, using the choice, difference, and range operators. Within a literal, conventional escapes can be used to denote characters for linefeed (\n), carriage-return (\r), and horizontal tab (\t). Numerical escape-codes are also supported in the form of decimal codes (\ddd, where d is a decimal), hexadecimal codes (\xhh where h is hexadecimal), and octal codes (\occc where c is octal).

Note
Currently, the " double-quote character can only be specified with a numerical escape code (e.g. \034, \x22, or \o042).

The choice operator, denoted with an infix |, matches one of a given set of literal strings.

  ["A" | "B" | "C"]

A character class is a set of related literal characters that is named for convenience. The character classes available in the standard library are AsciiCharS, DigitS, HexCharS and AlphaNumS.

It is often convenient to use all but a few of the characters from a character class. This can be done using the difference operator, denoted with an infix \. It denotes the difference between the character class specified as the left operand and the literal set given as the right operand. For example, the below literal set matches any ASCII character except (.

  [AsciiCharS \ "("]

The range operator, denoted with an infix .., matches any string covered by the range between the left literal set and the right literal set.

  ["10" .. "99"]
Note
There are restrictions on the kinds of arguments taken by these operators to ensure that the resulting literal set can be computed and matched efficiently.

The wildcard regular expression, denoted , matches any single byte. The matched value returned by the wildcard will be a single element byte list containing the matched byte. (The symbol is used instead of the more common ., since the . character is sometimes hard to see in an editor, and it conflicts with the . notation used in the module system for the grammar sublanguage.)

A regular expression non-terminal can be used as a regular expression element composable with other regular expressions using the regular expression combinators.

Scanning

Although scanning for specific tag bytes can be expressed as a regular expression, the regular expression can be complicated and hence inconvenient to write by hand. There is explicit support for such use cases using the forward scan (denoted with /sf) and backward scan (denoted with /sb) operators. A forward scan for the bytes tag would be written as /sf["tag"]; similarly, the corresponding backward scan would be /sb["tag"]. After a successful forward scan, the cursor would be positioned on the last tag byte (g in the case of /sf["tag"]) example), whereas after a successful backward scan, it is on the first tag byte (t in the case of /sb["tag"]). The bytes returned as the match correspond to all the bytes from the starting cursor position to the final position, inclusive. Matched bytes are always returned in forwards order, i.e. from the smallest matching cursor position to the largest.

For example, if the view contains the bytes 123abcxyz and the cursor is positioned on the a, /sf["xy"] will return abcxy as matched bytes, leaving the cursor positioned on y, while /sb["12"] from the same cursor position will return 123a as matched bytes, leaving the cursor on 1.

Assigning Synthesized Attributes in Rules

As mentioned above, the definition of a non-terminal is given by a sequence of rules composed by the ; ordered-choice operator, where each rule is composed of rule elements. The key characteristic of a rule is that it is a self-contained specification of one alternative way of parsing that non-terminal. The self-contained property means that the values of all of the non-terminal’s synthesized attributes have been computed and assigned by the end of the rule. Synthesized attributes are not implicitly initialized to default values; they need to be explicitly initialized if necessary.

Synthesized attributes are assigned values in action blocks, which can be composed with other rule elements using combinators. It is useful to keep in mind how the properties of some combinators interact with action blocks containing such attribute assignments. In particular, some combinators like the unbounded repeat * could succeed without executing the rule elements in its immediate scope (i.e. no other combinator under the * has scope over the elements). For example, the specification below has two errors:

  ByteVec bv {v: [byte]} :=
    (b=Byte { bv.v := b :: bv.v })*

The first error is that if the sequence under the * is executed the first time, b is prefixed to the uninitialized attribute bv.v. The second error is that it is possible that the production rule succeeds without matching any byte (e.g. when it is at the end of input). In that case, the v attribute would remain unassigned at the end of the rule. A specification that assigned a default value to v would fix both these errors. This assignment can be done either in the attribute declaration itself, or with an initial action block.

  ByteVec bv {v: [byte] := []} :=
    (b=Byte { bv.v := b :: bv.v })*

  ByteVec bv {v: [byte]} :=
    { bv.v := [] }
    (b=Byte { bv.v := b :: bv.v })*

The ? option combinator also has the property that it could succeed without matching any input, resulting in any attribute assigments in action blocks under its scope not being executed after a successful match. A similar consideration applies to the | ordered choice operator, if attribute assignments are done in some of the choices but not others.

Unlike the unbounded repeat *, the bounded repeat combinator ^ should execute the rule elements in its immediate scope if the bound is greater than 0. Parsley is able to take this into account when the bound is a constant non-zero expression. However, when the bound is computed from other entities in the context, Parsley takes a conservative view and assumes that the bound could be zero, i.e. that any action elements in its immediate scope may not execute.

In the case of the map-views combinator @#[views, RE], any action elements in RE do not execute if views is an empty list. Parsley currently cannot statically determine if the list is always non-empty, and so also takes a conservative view: it assumes that views could be empty, and that any action blocks in RE may not execute.

Bitvectors

Bits can be matched within production rules; however, all non-bit-oriented constructs need to be byte-aligned. An alignment specifier can be used to restore this alignment.

A bitvector of length n can be matched using the non-terminal BitVector<n>. Bits in bitvectors are indexed from 0, with 0 mapped to the least-significant bit. So, for example, the bits in a BitVector<8> that matches 0x0F are arranged from 7:0, with bits in 7:4 matching 0b0000 and those in 3:0 matching 0b1111.

A user-defined bitfield bf covering n bits can be matched using the specifier $bitfield(bf), which converts the n matched bits into a value of the bf bitfield type.

The $align<n>$ operator can be used to skip over bits to the next byte boundary lying at a multiple of n bits, where n must be a multiple of 8. For example, the sequence BitVector<1> $align<64>, when started on a byte boundary, matches 1 bit and then skips over 63 bits to end at a cursor position 64 bits from the starting point.

A sequence of bits can be matched using a sequence of BitVectors and bitfield specifiers. However, this sequence needs to start and end on a byte boundary. Parsley requires that within a rule, all regular expressions, non-terminals, and invocations of choice, star, option, and view-mapping combinators need to occur at byte-aligned positions in the parsing buffer.

If a failure is experienced while parsing a sequence of bits, the next parsing choice continues at the byte-aligned location at which the bit matching started.

An example of the use of the bitvector and bitfield support is shown below. The rule for non-terminal N matches 24 bits, with the first byte being ignored, the next 7 bits stored in attribute v, and the next 9 bits converted into a bitfield bf and stored in attribute f.

  bitfield bf = {
    top: 8:7,
    bot: 6:0
  }

  format {
    N n {v: bitvector<7>, f: bf} :=
      BitVector<1> $align<8> v=BitVector<7> f=$bitfield(bf)
      { n.v := v;
        n.f := f }
  }

If $bitfield(bf) matches 0b100000011 above, f.top will be 0b10 and f.bot will be 0b0000011.

Endianness

Endianness concerns arise when using bitvectors that are longer than 8 bits or that straddle byte boundaries. The bytes containing the bits being parsed are traversed in a left-to-right manner, i.e. in the direction of increasing byte offset into the parse buffer. This implies that the bytes in a long bitvector appear in a big-endian order. This needs to be accounted for when converting bitvectors into integers.

For example, if the non-terminal Bits below is used to parse the four bytes 0x11000102, the values of b.v1, b.v2 and b.v3 will be 0_u8, 17_u8 and 258_u32 respectively.

  format {
    Bits b {v1: u8, v2: u8, v3: u32} :=
      v1=BitVector<3> v2=BitVector<5> v3=BitVector<24> // matches 4-bytes
      { b.v1 := Bits.to_u8(v1);
        b.v2 := Bits.to_u8(v2);
        b.v3 := Bits.to_u32(v3) }
  }

The Bits module contains integer conversion routines that take an endianness specification, such as to_u16_endian, to_i32_endian, etc. The first or leftmost eight bits in a little-endian bitvector are considered to form the least-significant byte, whereas for a big-endian bitvector, the last or rightmost eight bits are considered to form the least-significant byte.

A little-endian conversion of the 24-bit v3 can be computed using Bits.to_u32_endian, resulting in b.v3 taking the value 131328_u32.

        b.v3 := Bits.to_u32_endian(endian::Little(), v3)

As the example above shows, the default integer conversion of a parsed bitvector treats the bytes within it in big-endian order.

Annotations

Parsley provides an annotation scheme that can be used to provide auxiliary information for non-terminals. Each non-terminal can be annotated via a decorator, using the syntax below:

#[deco_type1(deco_val1:key1=val1,key2,key3=val3);deco_type2(deco_val2: ...)]
NT ... := ...

The decorator is specified immediately before the non-terminal it applies to, and consists of a sequence of decorator specifiers separated by ;. Each specifier has a decorator type and its decorator value, with the value enclosed within a pair of parentheses ( ) The value can be optionally be parameterized by a list of keys with optional values that are separated by ,. If this parameter list is present, it is separated from the value by a :, and a parameter key with a value is denoted simply with key=value.

All the entities in the decorator are simple identifiers; i.e. expression syntax cannot be used as types, keys or values.

A decorator cannot have two specifiers with the same decorator type, and a decorator value cannot have two keys with the same name.

Parsley only interprets decorators that it supports; other decorators are silently ignored.

Note: The decorator "type" identifier is unrelated any identifier in the Parsley type system.

Whitespace decorations

Unlike parsers for programming languages that are designed to be readable by humans, Parsley is normally used for specify data formats that are often optimized for machine processing rather than human readability. As a result, it does not have any implicit notion of whitespace. However, one often needs to specify data formats that can have fragments that use a notion of whitespace, and the grammar elements in these formats can be tedious to specify when every occurence of whitespace needs to be explicitly specified.

To help with this use case, Parsley supports annotating a non-terminal (say NT) with a whitespace decorator that names another non-terminal (say WS) that can be considered to separate the rule elements in each rule of NT. The goal here is to recover the convenience of an implicit notion of whitespace when specifying the rules of the non-terminal NT, and making explicit this notion of whitespace just once in the decorator.

As an example, suppose the following definition was used to define the WS non-terminal to match possibly empty whitespace:

  WS (allow_empty: bool) :=
    [allow_empty]
    (# [" " | "\t" | "\r" | "\n"]* #)
  ; [!allow_empty]
    (# [" " | "\t" | "\r" | "\n"]+ #)
  ;;

The inherited attribute allow_empty controls whether WS can match empty whitespace.

Then it can be used to specify the whitespace separating the rule elements for a non-terminal NT as follows:

  #[whitespace(WS:allow_empty=true)]
  NT := A B
      ; ((A C) | (B D))*
      ;;

This is equivalent to the following definition of NT without a whitespace decorator, which is quite verbose.

  NT := // first rule after decoration
        WS<allow_empty=bool::True()> A WS<allow_empty=bool::True()> B WS<allow_empty=bool::True()>

        // second rule after decoration
      ; WS<allow_empty=bool::True()>
        (  (A WS<allow_empty=bool::True()> C WS<allow_empty=bool::True>())
        | (B WS<allow_empty=bool::True()> D WS<allow_empty=bool::True>())
        )*
        WS<allow_empty = bool::True>
      ;;

Parsley uses some heuristics to insert the specified whitespace non-terminal into the decorated rules, which in most cases should give the same result as a parser with an implicit notion of whitespace.

Caution

There are some situations where the decorator should only be used with considerable caution, if at all:

  • the whitespace non-terminal cannot match empty whitespace,

  • the non-terminal that is a candidate for decoration has rules that use constraints that involve cursor locations, or bit-level constructions that are typically quite dependent on appropriate byte-alignment, or view constructions that rely crucially on cursor offset locations.

In these situations, it is likely that the automatically inserted whitespace might violate assumptions about byte-alignment or cursor locations in views and constraints, or about matching optional whitespace.

Parsley uses some heuristics when inserting whitespace in such rules. For example, it does not introduce whitespace immediately before constraints or any view mapping combinator, and before or after any construct dealing with bitvectors.

Note
One can use the -dd option to print the rules of a non-terminal after whitespace decoration to ensure they align with expectations.

Modules

Parsley supports the use of definitions from one Parsley specification, considered as a module, within another Parsley specification; this is described further in the expression language documentation.

As described there, the non-terminal used as a rule element in a production rule for a non-terminal definition can refer to a non-terminal defined in another module: M.NT refers to the non-terminal NT defined in module M.

Unlike non-terminals, Parsley does not support cross-module use of definitions for regular expressions and literals. A specification can however refer to regular expressions and literals defined in the Parsley standard library.