Skip to content

Latest commit

 

History

History
210 lines (166 loc) · 9.39 KB

intro.adoc

File metadata and controls

210 lines (166 loc) · 9.39 KB

Introduction to Parsley

The Parsley system is a specification language and a set of tools to formally specify data formats and the parsing software that processes data in those formats. We focus below on describing the Parsley specification language; the tools in the rest of the system, including the parser generator and the prover framework, will be described as those tools become more usable.

The Parsley specification language describes data formats. It does this using two sub-languages: an expression sub-language that looks like a functional language, and a grammar sub-language that looks like an extended Backus-Naur Form notation that incorporates elements of an attribute grammar and the combinators from parsing expression grammars (PEGs).

Parsley seeks to provide the following features that are not often found in conventional grammar description systems that are based on context-free grammars:

  • Clean support for context sensitivity, especially context that may be distant from the current parsing location. This is provided using the attribute grammar system, in which inherited attributes are used to communicate this context.

  • A constraint system that can compute over this context to control the parsing rules that are applied at a parsing location.

  • A typed functional language that expresses the computations involved in computing attribute values and the constraints.

  • A module system to compose independent specifications, which allows specifying container data formats that contain nested objects or payloads conforming to different formats.

A high-level introduction to the sub-languages and a Parsley specification file follow.

The expression sub-language

Parsley uses a fairly typical typed first-order functional language as the expression sub-language. This language allows a user to specify types and functions. These can be then used in the grammar sub-language to specify the types of the grammar non-terminals, the computations in constraint expressions, and the computation of values to be assigned to synthesized attributes.

The basic atomic types in this language are byte, integer types of typical widths and signedness, double, string, bool, endian, bit and unit (known as void in C-like languages). In addition, the language includes parsing buffers as atomic values, which belong to a special atomic type called view. The basic type constructors are lists, tuples, and option types. Bitvectors of concrete lengths are also supported as type constructors. User-specified types include records and variants, and can be mutually recursive. Bitfield records can also be user-defined.

The language is equipped with a module system that provides access to a standard library. The library provides data structures such as lists, sets and maps, as well as operations on and conversions between the atomic types.

Note

The functional language is currently first-order: it does not support first-class functions or closures. All functions need to be completely applied, and cannot be curried.

The grammar sub-language

The grammar sub-language is used to define the non-terminals in the grammar. A non-terminal is defined in terms of a sequence of production rules. These rules form an ordered choice: later rules in the sequence are only tried if the earlier rules fail to match the input.

Each standard non-terminal can be specified to have a set of synthesized attributes, each with a declared type. The type of such a non-terminal thus forms a record, with each synthesized attribute corresponding to a field in the record.

A production rule is itself a sequence of rule elements. Primitive rule elements are literals, references to other non-terminals, constraints, and action blocks. Action blocks are typically used to compute the synthesized attributes of the non-terminal being defined. These primitive elements can be combined using typical grammar combinators such as Kleene star and ordered choice. Each rule element can be bound to a name, which can be used later in the rule to refer to the value matched by that element. In addition, rule elements can be restricted to certain offsets within the parsing buffer, a different parsing buffer (represented by a value of type view), or applied to each view in a list of views.

Bitvectors can also be used as rule-elements. However, there are alignment restrictions on their use, and Parsley requires non-bit-oriented constructs to be byte-aligned. Byte alignment can be ensured with the use of suitable alignment and padding specifiers.

The Parsley type-checker computes a type for each element of the production rule to ensure that the computations performed in the rule are well-typed.

For convenience, the sub-language provides special syntactic support for simple regular expressions, where the attribute grammar machinery can be unnecessarily heavy. In such cases, no synthesized attributes are specified for regular-expression non-terminals, and the type of these non-terminals is a list of bytes.

The standard library provides some useful primitive non-terminals that can be used to match basic types. Individual bytes can be matched against the non-terminals AsciiChar, HexChar, AlphaNum, Digit and Byte, each of which have the byte type. The S-suffixed versions of these non-terminals (such as AsciiCharS, DigitS, …​ ) represent the match as a single-element list containing the matched byte. This allows them to be used in with combinators that combine regular expressions or list-valued non-terminals.

Other primitive non-terminals support parsing binary integers of various widths, signedness and endianness.

Structure of a Parsley specification file

Parsley source files are named with a .ply suffix.

The specification of a format can be split across multiple files using the include specifier described below. A specification has a top-level source file, which is typically the file specified on the command line to the Parsley compiler. The name for the Parsley specification is derived from the name of this top-level file: the top-level filename is stripped of its .ply suffix and then capitalized (i.e. the first character is made uppercase if it is not already uppercase, with the remaining characters unchanged).

Any one of these files will have the same structure.

A file contains a sequence of declarations, each of which can be one of the following:

  • include specifiers for inclusions of declarations from other Parsley files in the same specification

  • import specifiers for imports of declarations from other Parsley specifications

  • type definitions

  • const definitions of constant expressions

  • fun (function) or recfun (recursive function) definitions

  • format blocks containing a sequence of non-terminal definitions

The grammar sub-language is used only within the format blocks.

include specifiers

File inclusion allows a specification to be split across multiple files, with closely related declarations appearing in the same file. There will be typically a top-level Parsley file for the entire specification, and this is the only file that needs to be passed to the Parsley compiler.

Note
All source files for a single Parsley specification need to be in the same directory.

Conceptually, the contents of the included file are placed at the point of the first include declaration for that file. Subsequent include declarations of the same file do not have any effect. The specification is only processed once all include declarations have been substituted away, and all subsequent processing is performed on the entire specification.

Scoping

The scoping of type and fun/recfun is lexical: a type definition can only use the types defined by definitions in scope at that point. A const definition can only refer to types and other constants and functions whose definitions are in scope. Similarly, a fun function definition can only refer to types or call other functions whose definitions are in scope. recfun function definitions support mutually recursive function definitions that can call themselves and each other, as well as other functions that are in scope.

type, const and fun/recfun definitions are processed before any format block, so all type, constant and function definitions are in scope in every format block.

Comments

Comments are marked with a leading // and continue until the end of the line.

import specifiers and specification modules

Parsley supports the modular composition of specifications. Each specification is treated as a module, with the module name given by the specification name as described above (i.e. the capitalized basename of the top-level source file of the specification).

The modules given by an import specifier are searched for in the import directories specified on the command line. Once a top-level source file has been found for a module in an import directory, that directory is expected to contain all the source files for that module, as described above.

Parsley modules are capitalized identifiers (i.e. the first character of the identifier is uppercase), whereas their source files need not be capitalized. However, it is an error if uncapitalized and capitalized versions of a top-level source file are both present.

More information about the module system is present in the expression language documentation.