Skip to content

Latest commit

 

History

History
331 lines (254 loc) · 16.9 KB

udp.adoc

File metadata and controls

331 lines (254 loc) · 16.9 KB

Lesson 1: The User Datagram Protocol

For our first exercise, we examine the User Datagram Protocol (UDP). The UDP specification is provided in IETF RFC 768. UDP is a transport layer protocol used alongside IP. In contrast to TCP, another transport layer protocol, UDP does not provide some useful guarantees, such as reliability and in-order messaging. However, it is lightweight and simple, making it a great introductory example use case for this Parsley tutorial.

Declarations

A format specification in Parsley is constructed via a sequence of declarations. Parsley currently supports the following classes of declarations:

  • include declarations that enable the inclusion of declarations from other files within the same Parsley specification.

  • import declarations that enable the import of declarations from files from other Parsley specifications.

  • const definitions to support .,constant expressions.

  • type definitions that allow users to specify their own types.

  • fun and recfun declarations that correspond to functions and recursive functions respectively.

  • format block declarations that specify non-terminal definitions.

In this first lesson, we will use the format and type declarations.

A (Very) Brief Introduction to Parsley Types

The specification states that the UDP header comprises four fields corresponding to the source port, destination port, length, and checksum. Each is of two bytes. In our specification, the source port, destination port, and length fields are treated as 16-bit unsigned integers, and the checksum is treated as a list of bytes. That said, there exist valid alternatives to this approach, e.g., the source port and destination port could also be byte lists. Following the UDP header is the payload or data. Given this structure, we can declare a type to capture a UDP packet:

type udp_packet = {
  source_port: u16,
  destination_port: u16,
  length: u16,
  checksum: [byte],
  data: [byte]
}

We provide a quick primer on types here but a more comprehensive treatment can be found in the documentation.

Note
As an aside, the observant reader may have noticed that the udp_packet type declaration does not enforce the requirement that the checksum field be exactly two bytes long. However, the udp_packet is merely a user-defined type. The enforcement of the length requirement will be done when we specify the grammar in the the format block. Don’t fret! We will get to this soon.

Atomic Types

Parsley supports a variety of atomic or built-in types:

  • byte: captures a byte

  • u8 , u16, u32, u65; i8, i16, i32, i64; usize; isize: are used to capture various kinds of integers.

    • u8, u16, u32, and u64 correspond to unsigned integers of bit lengths of 8, 16, 32, and 64 respectively.

    • i8, i16, i32, and i64 correspond to signed integers of bit lengths of 8, 16, 32, and 64 respectively.

    • usize corresponds to an unsized integer of unspecified length.

    • isize corresponds to a signed integer of unspecified length.

  • double: captures a floating-point

  • string: captures a sequence of characters

  • bool: captures a Boolean or truth value

  • endian: captures endian values that express what byte ordering to use when interpreting integer values

  • unit: a type to express that a function returns nothing (similar to void in C)

User-Defined Types

Parsley also supports user-defined types. We can construct these user-defined types via type declarations of the form:

type type_name = type_form

where type_name and type_form correspond to the identifier for the type and the structure or form of the type.

Some examples of user-defined type declarations are:

  • type byte_list = [byte] is a list of bytes.

  • type int_bool_tuple = (u16, bool) is a tuple with element types u16 and bool respectively

  • type list_of_tuples = [(u32, u32)] is a list of two-tuples with element types u32 and u32.

A special kind of user-defined type is the record type. The record type comprises a number of component type fields and has the general form of:

type type_name = {
  field_name_0: field_type_0,
  field_name_1: field_type_1,
  ...
  field_name_n: field_type_n
}

Each line within the curly brackets specifies a component field via a field name and type. Each field declaration, aside from the last one, is terminated by a , delimiter.

Note
There is much more to discuss about types, notably variant types and type variables. The Parsley documentation on types discusses these topics further.

Declaring the UDP Packet Type

Let’s return to our UDP packet.

type udp_packet = {
  source_port: u16,
  destination_port: u16,
  length: u16,
  checksum: [byte],
  data: [byte]
}

The udp_packet is a record type that comprises five component subtypes or fields.

The first four fields correspond to the UDP header: source_port is a 16-bit unsigned integer, destination_port is a 16-bit unsigned integer, length is a 16-bit unsigned integer, and checksum is a list of bytes. The last component field, `data, captures the the payload, and it is also a list of bytes. Now that we have captured the general structure of a UDP packet, we can move on to capturing the specification in Parsley.

The Format Block

The format block is essential to every Parsley specification. This is where we define non-terminals and, as the name suggests, specify the format of the specification. The format block has the general form:

format {
  NT1 n1 ... := ...;;
  NT2 n2 ... := ...;;
  ...
}

The format block is simply a sequence of non-terminal definitions. Each definition, except for the last, must end with ;;, which serves as a non-terminal definition delimiter.

Non-Terminals, Synthesized Attributes, and Inherited Attributes

Parsley provides a rich subgrammar that has some similarity with BNF but also a number of key differences. Here, we will only cover the basics; more topics will be introduced in subsequent lessons. A non-terminal is derived using one or more production rules that are separated by ;, the ordered choice operator. Each production rule is a sequence of rule elements. A sequence of rule elements is matched if the input matches each of the rule elements in order.

In this first lesson, we cover the standard approach to defining non-terminals. A non-terminal in Parsley has a type associated with it, which is always the record type using the standard approach. This type is specified by synthesized attributes, which can be thought of as attributes that are derived as we perform the matching. Additionally, a type declaration may include inherited attributes, which are attributes that are provided when the non-terminal is used in a rule element or invoked by a call from external application code.

Note
Parsley also supports an alternative approach for defining byte-vector—​valued non-terminals, one which uses regular expressions. We shall cover this in a future lesson!

Working Through the UDP Packet Format

Now, let’s work on developing a Parsley specification for a UDP packet:

format {
  UDPPacket udpp {u: udp_packet} := ...
}

We still have to fill in the details obviously, but this is a good start. We have a non-terminal entitled UDPPacket that has the short name of udpp. And it has the synthesized attribute u of type udp_packet, which we had defined earlier; synthesized attributes like these are specified within curly brackets.

Note
While the original specification does not explicitly require using the big-endian byte order, the Assigned Numbers RFC specifies that Internet Protocols should be expressed in network byte order, i.e., big-endian byte order. Hence, we follow this approach. However, If we had instead wished to pass in an endian value as an inherited attribute to UDPPacket, we would have written: UDPPacket udpp (e: endian) {u: udp_packet} := …​.

Now, let’s work on the rule elements. The UDP specification states the header comprises four fields, each of two bytes, corresponding to the source port, the destination port, the length, and the checksum. Thus, we have four identifiers (source_port, destination_port, length, checksum) to capture these header fields. And, of course, we have data whose length is derived from the value matched by the length identifier.

format {
  UDPPacket udpp {u: udp_packet} :=
    source_port = UInt16<endian=endian::Big()>
    destination_port = UInt16<endian=endian::Big()>
    length = UInt16<endian=endian::Big()>
    checksum = (Byte^2u)
    [length >=_u16 8u16]
    data = (Byte^(Usize.of_u16(length) -_u 8u))
}

UInt16 (endian: endian) and Byte are built-in non-terminals that match a 16-bit unsigned integer and a byte respectively. The identifiers source_port, destination_port, and length are assigned 16-bit unsigned integer values of the endianness endian::Big(). The identifier checksum is matched to a list of two bytes by applying the bounded repeat operator (^) to the Byte built-in non-terminal using the operand 2u (a literal of type usize with value 2). A 16-bit unsigned integer of the endianness endian::Big() is assigned to the length identifier.

Note
The Parsley Library Standard Documentation provides an in-depth discussion of built-in variant types, built-in non-terminals, and module operations.
Note
Parsley supports the construction of various kinds of literals corresponding to different types, e.g., 4u8 is an 8-bit unsigned integer corresponding to the value 4, 7i64 is a 64-bit signed integer corresponding to the value 7, endian::Small() is an endian value of small, "hello" is the string reading hello. As we come across literals in this tutorial, we will briefly remark on them; a more thorough treatment can be found in the documentation on expressions.

Our next task is to read in the data, but this requires first ensuring the length value is reasonable. We do this by imposing a constraint.

Constraints

A constraint is simply a parsing check. More precisely, a constraint captures a Boolean-valued expression that determines whether or not to continue with the current production rule. If the expression evaluates to a false Boolean value, then the constraint is not met and the production rule does not apply. If, however, the constraint evaluates to a true Boolean value, then the constraint is met and we move on to the next rule element in the production rule.

In our UDP example, the constraint [length >=_u16 8u16] does a check to ensure the length variable is assigned a value of at least 8. More precisely, it uses the built-in greater-than-or-equal-to operator for 16-bit unsigned integers to ensure that the identifier length matches a u16 value that is at least 8u16. If it were assigned a value less than 8u16, then we would immediately determine that the packet is malformed since the header itself is 8 bytes.

Reading the Payload

Next, we must read in the bytes corresponding to data. The exact amount of bytes to read can be obtained by some simple arithmetic using the previously matched length identifier. Indeed, the pattern of reading in some length value and then later reading in a payload based on that value is extremely common when dealing with specifications.

In our case, the length identifier we had matched earlier corresponds to the full UDP packet length; ergo, the payload length is obtained by subtracting 8 from it. We apply the bounded repeat operator (^) to Byte, in similar fashion as we had done when reading in the checksum. However, as the operand to the bounded repeat operator (^) must be of type usize, we must first use the standard library function Usize.of_u16() to create a value that is of the same length as length but of type usize. We then subtract 8u (value 8 of type usize) from the resultant value by using the subtraction operator for operands of type usize (-_u). Thus, we get: data = (Byte^(Usize.of_u16(length) -_u 8u)).

Action Blocks

All that is left is to assign values to the synthesized variables. We do this via action blocks:

format {
  UDPPacket udpp {u: udp_packet} :=
    ...
    {
      udpp.u.source_port := source_port;
      udpp.u.destination_port := destination_port;
      udpp.u.length := length;
      udpp.u.checksum := checksum;
      udpp.u.data := data
    }
}

The assignment of values to the synthesized attributes is performed by the action block, which is delimited by { and }. The := symbol in action blocks denote assignment. Each assignment statement in an action block aside from the last must end with the ; delimiter. There are other uses for action blocks as well, and we will cover them in future lessons.

In our UDP example, udpp is the short name for the non-terminal we are working on. udpp.u refers to the udp_packet record attached to the udpp non-terminal. In a straightforward fashion, we are simply assigning the fields in the udpp.u record the values of the variables that we have obtained by reading the input earlier.

The Final Parsley File

Thus, we have our final Parsley file (udp.ply):

type udp_packet = {
  source_port: u16,
  destination_port: u16,
  length: u16,
  checksum: [byte],
  data: [byte]
}

format {
  UDPPacket udpp {u: udp_packet} :=
    source_port = UInt16<endian=endian::Big()>
    destination_port = UInt16<endian=endian::Big()>
    length = UInt16<endian=endian::Big()>
    checksum = (Byte^2u)
    [length >=_u16 8u16] // length field must be at least 8, the UDP header length
    data = (Byte^(Usize.of_u16(length) -_u 8u)) // subtract 8 for the UDP header length
    //[validate_checksum(source_port, destination_port, length, checksum, data)]
    {
      udpp.u.source_port := source_port;
      udpp.u.destination_port := destination_port;
      udpp.u.length := length;
      udpp.u.checksum := checksum;
      udpp.u.data := data
    }
}

Comments

You may noticed that we have incorporated some comments in our final Parsley file. Comments in Parsley begin with two forward slash characters (//). Everything between the // delimiter and the first new line character to follow it is part of the comment.

Do We Need a udp_packet Type?

We had earlier declared a udp_packet type, which is used as a synthesized attribute in the UDPPacket non-terminal declaration. Of course, we did not have to define such a type. Indeed, we could have instead supplied synthesized attributes corresponding to the type’s member fields. We have provided such an implementation in udp_alt.ply, and we have pasted the Parsley file’s contents below.

format {
  UDPPacket udpp {source_port: u16, destination_port: u16, length: usize,
                  checksum: [byte], data: [byte]} :=
    source_port = UInt16<endian=endian::Big()>
    destination_port = UInt16<endian=endian::Big()>
    length = UInt16<endian=endian::Big()>
    checksum = (Byte^2u)
    [length >=_u16 8u16] // length field must be at least 8, the UDP header length
    data = (Byte^(Usize.of_u16(length) -_u 8u)) // subtract 8 for the UDP header length
    {
      udpp.source_port := source_port;
      udpp.destination_port := destination_port;
      udpp.length := Usize.of_u16(length);
      udpp.checksum := checksum;
      udpp.data := data
    }
}

That said, there are numerous benefits to declaring and using a record type, e.g., modularization, code brevity, readability, and consequently improved security. These benefits are especially pronounced in larger projects that make heavy use of the user-defined type. In the next lesson, we will demonstrate how to use the udp_packet type when implementing a subset of IPv4.

Exercise: The Simple Message Specification

As a simple exercise, consulting the Parsley documentation on expressions, try to write a Parsley specification for the following mock message format:

A simple message contains four parts, ordered as follows: (1) the 5-byte string "hello", (2) a length value that specifies, the byte-length of a short message as an 8-bit unsigned integer, (3) a message that is a list of bytes of length length, (4) the 7-byte string "goodbye".

For the Parsley specification, store length and message in a record type as a synthesized attribute. An example Parsley specification for simple message is provided in simple_message.ply.

Note
There is an even easier way to implement this specification in Parsley using regular expressions, which we will cover in a couple of lessons.