For our first exercise, we examine the User Datagram Protocol (UDP). The UDP specification is provided in IETF RFC 768. UDP is a transport layer protocol used alongside IP. In contrast to TCP, another transport layer protocol, UDP does not provide some useful guarantees, such as reliability and in-order messaging. However, it is lightweight and simple, making it a great introductory example use case for this Parsley tutorial.
A format specification in Parsley is constructed via a sequence of declarations. Parsley currently supports the following classes of declarations:
-
include
declarations that enable the inclusion of declarations from other files within the same Parsley specification. -
import
declarations that enable the import of declarations from files from other Parsley specifications. -
const
definitions to support .,constant expressions. -
type
definitions that allow users to specify their own types. -
fun
andrecfun
declarations that correspond to functions and recursive functions respectively. -
format
block declarations that specify non-terminal definitions.
In this first lesson, we will use the format
and type
declarations.
The specification states that the UDP header comprises four fields corresponding to the source port, destination port, length, and checksum.
Each is of two bytes.
In our specification, the source port, destination port, and length fields are treated as 16-bit unsigned integers, and the checksum is treated as a list of bytes.
That said, there exist valid alternatives to this approach, e.g., the source port and destination port could also be byte lists.
Following the UDP header is the payload or data.
Given this structure, we can declare a type
to capture a UDP packet:
type udp_packet = { source_port: u16, destination_port: u16, length: u16, checksum: [byte], data: [byte] }
We provide a quick primer on types here but a more comprehensive treatment can be found in the documentation.
Note
|
As an aside, the observant reader may have noticed that the udp_packet type declaration does not enforce the requirement that the checksum field be exactly two bytes long.
However, the udp_packet is merely a user-defined type.
The enforcement of the length requirement will be done when we specify the grammar in the the format block.
Don’t fret!
We will get to this soon.
|
Parsley supports a variety of atomic or built-in types:
-
byte
: captures a byte -
u8
,u16
,u32
,u65
;i8
,i16
,i32
,i64
;usize
;isize
: are used to capture various kinds of integers.-
u8
,u16
,u32
, andu64
correspond to unsigned integers of bit lengths of 8, 16, 32, and 64 respectively. -
i8
,i16
,i32
, andi64
correspond to signed integers of bit lengths of 8, 16, 32, and 64 respectively. -
usize
corresponds to an unsized integer of unspecified length. -
isize
corresponds to a signed integer of unspecified length.
-
-
double
: captures a floating-point -
string
: captures a sequence of characters -
bool
: captures a Boolean or truth value -
endian
: captures endian values that express what byte ordering to use when interpreting integer values -
unit
: a type to express that a function returns nothing (similar to void in C)
Parsley also supports user-defined types. We can construct these user-defined types via type declarations of the form:
type type_name = type_form
where type_name
and type_form
correspond to the identifier for the type and the structure or form of the type.
Some examples of user-defined type declarations are:
-
type byte_list = [byte]
is a list of bytes. -
type int_bool_tuple = (u16, bool)
is a tuple with element typesu16
andbool
respectively -
type list_of_tuples = [(u32, u32)]
is a list of two-tuples with element typesu32
andu32
.
A special kind of user-defined type is the record type. The record type comprises a number of component type fields and has the general form of:
type type_name = { field_name_0: field_type_0, field_name_1: field_type_1, ... field_name_n: field_type_n }
Each line within the curly brackets specifies a component field via a field name and type.
Each field declaration, aside from the last one, is terminated by a ,
delimiter.
Note
|
There is much more to discuss about types, notably variant types and type variables. The Parsley documentation on types discusses these topics further. |
Let’s return to our UDP packet.
type udp_packet = { source_port: u16, destination_port: u16, length: u16, checksum: [byte], data: [byte] }
The udp_packet
is a record type that comprises five component subtypes or fields.
The first four fields correspond to the UDP header: source_port
is a 16-bit unsigned integer, destination_port
is a 16-bit unsigned integer, length
is a 16-bit unsigned integer, and checksum
is a list of bytes.
The last component field, `data, captures the the payload, and it is also a list of bytes.
Now that we have captured the general structure of a UDP packet, we can move on to capturing the specification in Parsley.
The format
block is essential to every Parsley specification.
This is where we define non-terminals and, as the name suggests, specify the format of the specification.
The format
block has the general form:
format { NT1 n1 ... := ...;; NT2 n2 ... := ...;; ... }
The format
block is simply a sequence of non-terminal definitions.
Each definition, except for the last, must end with ;;
, which serves as a non-terminal definition delimiter.
Parsley provides a rich subgrammar that has some similarity with BNF but also a number of key differences.
Here, we will only cover the basics; more topics will be introduced in subsequent lessons.
A non-terminal is derived using one or more production rules that are separated by ;
, the ordered choice operator.
Each production rule is a sequence of rule elements.
A sequence of rule elements is matched if the input matches each of the rule elements in order.
In this first lesson, we cover the standard approach to defining non-terminals. A non-terminal in Parsley has a type associated with it, which is always the record type using the standard approach. This type is specified by synthesized attributes, which can be thought of as attributes that are derived as we perform the matching. Additionally, a type declaration may include inherited attributes, which are attributes that are provided when the non-terminal is used in a rule element or invoked by a call from external application code.
Note
|
Parsley also supports an alternative approach for defining byte-vector—valued non-terminals, one which uses regular expressions. We shall cover this in a future lesson! |
Now, let’s work on developing a Parsley specification for a UDP packet:
format { UDPPacket udpp {u: udp_packet} := ... }
We still have to fill in the details obviously, but this is a good start.
We have a non-terminal entitled UDPPacket
that has the short name of udpp
.
And it has the synthesized attribute u
of type udp_packet
, which we had defined earlier; synthesized attributes like these are specified within curly brackets.
Note
|
While the original specification does not explicitly require using the big-endian byte order,
the Assigned Numbers RFC specifies that Internet Protocols should be expressed in network byte order, i.e., big-endian byte order.
Hence, we follow this approach. However, If we had instead wished to pass in an endian value as an inherited attribute to UDPPacket, we would have written: UDPPacket udpp (e: endian) {u: udp_packet} := … .
|
Now, let’s work on the rule elements.
The UDP specification states the header comprises four fields, each of two bytes, corresponding to the source port, the destination port, the length, and the checksum.
Thus, we have four identifiers (source_port
, destination_port
, length
, checksum
) to capture these header fields.
And, of course, we have data
whose length is derived from the value matched by the length
identifier.
format { UDPPacket udpp {u: udp_packet} := source_port = UInt16<endian=endian::Big()> destination_port = UInt16<endian=endian::Big()> length = UInt16<endian=endian::Big()> checksum = (Byte^2u) [length >=_u16 8u16] data = (Byte^(Usize.of_u16(length) -_u 8u)) }
UInt16 (endian: endian)
and Byte
are built-in non-terminals that match a 16-bit unsigned integer and a byte respectively.
The identifiers source_port
, destination_port
, and length
are assigned 16-bit unsigned integer values of the endianness endian::Big()
.
The identifier checksum
is matched to a list of two bytes by applying the bounded repeat operator (^
) to the Byte
built-in non-terminal using the operand 2u
(a literal of type usize
with value 2).
A 16-bit unsigned integer of the endianness endian::Big()
is assigned to the length identifier.
Note
|
The Parsley Library Standard Documentation provides an in-depth discussion of built-in variant types, built-in non-terminals, and module operations. |
Note
|
Parsley supports the construction of various kinds of literals corresponding to different types, e.g., 4u8 is an 8-bit unsigned integer corresponding to the value 4, 7i64 is a 64-bit signed integer corresponding to the value 7, endian::Small() is an endian value of small, "hello" is the string reading hello. As we come across literals in this tutorial, we will briefly remark on them; a more thorough treatment can be found in the documentation on expressions.
|
Our next task is to read in the data, but this requires first ensuring the length value is reasonable. We do this by imposing a constraint.
A constraint is simply a parsing check. More precisely, a constraint captures a Boolean-valued expression that determines whether or not to continue with the current production rule. If the expression evaluates to a false Boolean value, then the constraint is not met and the production rule does not apply. If, however, the constraint evaluates to a true Boolean value, then the constraint is met and we move on to the next rule element in the production rule.
In our UDP example, the constraint [length >=_u16 8u16]
does a check to ensure the length
variable is assigned a value of at least 8.
More precisely, it uses the built-in greater-than-or-equal-to operator for 16-bit unsigned integers to ensure that the identifier length
matches a u16
value that is at least 8u16
.
If it were assigned a value less than 8u16
, then we would immediately determine that the packet is malformed since the header itself is 8 bytes.
Next, we must read in the bytes corresponding to data
.
The exact amount of bytes to read can be obtained by some simple arithmetic using the previously matched length
identifier.
Indeed, the pattern of reading in some length value and then later reading in a payload based on that value is extremely common when dealing with specifications.
In our case, the length
identifier we had matched earlier corresponds to the full UDP packet length; ergo, the payload length is obtained by subtracting 8 from it.
We apply the bounded repeat operator (^
) to Byte
, in similar fashion as we had done when reading in the checksum.
However, as the operand to the bounded repeat operator (^
) must be of type usize
, we must first use the standard library function Usize.of_u16()
to create a value that is of the same length as length
but of type usize
.
We then subtract 8u
(value 8 of type usize
) from the resultant value by using the subtraction operator for operands of type usize
(-_u
).
Thus, we get: data = (Byte^(Usize.of_u16(length) -_u 8u))
.
All that is left is to assign values to the synthesized variables. We do this via action blocks:
format { UDPPacket udpp {u: udp_packet} := ... { udpp.u.source_port := source_port; udpp.u.destination_port := destination_port; udpp.u.length := length; udpp.u.checksum := checksum; udpp.u.data := data } }
The assignment of values to the synthesized attributes is performed by the action block, which is delimited by {
and }
.
The :=
symbol in action blocks denote assignment.
Each assignment statement in an action block aside from the last must end with the ;
delimiter.
There are other uses for action blocks as well, and we will cover them in future lessons.
In our UDP example, udpp
is the short name for the non-terminal we are working on.
udpp.u
refers to the udp_packet
record attached to the udpp
non-terminal.
In a straightforward fashion, we are simply assigning the fields in the udpp.u
record the values of the variables that we have obtained by reading the input earlier.
Thus, we have our final Parsley file (udp.ply
):
type udp_packet = { source_port: u16, destination_port: u16, length: u16, checksum: [byte], data: [byte] } format { UDPPacket udpp {u: udp_packet} := source_port = UInt16<endian=endian::Big()> destination_port = UInt16<endian=endian::Big()> length = UInt16<endian=endian::Big()> checksum = (Byte^2u) [length >=_u16 8u16] // length field must be at least 8, the UDP header length data = (Byte^(Usize.of_u16(length) -_u 8u)) // subtract 8 for the UDP header length //[validate_checksum(source_port, destination_port, length, checksum, data)] { udpp.u.source_port := source_port; udpp.u.destination_port := destination_port; udpp.u.length := length; udpp.u.checksum := checksum; udpp.u.data := data } }
You may noticed that we have incorporated some comments in our final Parsley file.
Comments in Parsley begin with two forward slash characters (//
).
Everything between the //
delimiter and the first new line character to follow it is part of the comment.
We had earlier declared a udp_packet
type, which is used as a synthesized attribute in the UDPPacket non-terminal declaration.
Of course, we did not have to define such a type.
Indeed, we could have instead supplied synthesized attributes corresponding to the type’s member fields.
We have provided such an implementation in udp_alt.ply
, and we have pasted the Parsley file’s contents below.
format { UDPPacket udpp {source_port: u16, destination_port: u16, length: usize, checksum: [byte], data: [byte]} := source_port = UInt16<endian=endian::Big()> destination_port = UInt16<endian=endian::Big()> length = UInt16<endian=endian::Big()> checksum = (Byte^2u) [length >=_u16 8u16] // length field must be at least 8, the UDP header length data = (Byte^(Usize.of_u16(length) -_u 8u)) // subtract 8 for the UDP header length { udpp.source_port := source_port; udpp.destination_port := destination_port; udpp.length := Usize.of_u16(length); udpp.checksum := checksum; udpp.data := data } }
That said, there are numerous benefits to declaring and using a record type, e.g., modularization, code brevity, readability, and consequently improved security.
These benefits are especially pronounced in larger projects that make heavy use of the user-defined type.
In the next lesson, we will demonstrate how to use the udp_packet
type when implementing a subset of IPv4.
As a simple exercise, consulting the Parsley documentation on expressions, try to write a Parsley specification for the following mock message format:
A simple message contains four parts, ordered as follows: (1) the 5-byte string "hello", (2) a length
value that specifies, the byte-length of a short message as an 8-bit unsigned integer, (3) a message
that is a list of bytes of length length
, (4) the 7-byte string "goodbye".
For the Parsley specification, store length
and message
in a record type as a synthesized attribute. An example Parsley specification for simple message is provided in simple_message.ply.
Note
|
There is an even easier way to implement this specification in Parsley using regular expressions, which we will cover in a couple of lessons. |
Navigation: ↑ Tutorial Overview | → Next Lesson | 📄 Documentation