Skip to content
This repository has been archived by the owner on Feb 1, 2020. It is now read-only.

Design of the new compiler parser

dwightguth edited this page Oct 23, 2014 · 30 revisions

Since it's a little complex I decided to write up a wiki page for the design of the implementation of the compiler and parser once we have completed all the current set of refactorings. I would like in particular @radumereuta, @cos, and @osa1 to review this design and let me know if it is satisfactory to them. I also include a section at the end corresponding to two parallel paths of work on the implementation which begin the migration path we are dealing with. I would like @osa1 and @radumereuta to work on the first of these two paths which begins the work needed to move forward with the new frontend, and @cos to work on the second of these two paths, which begins the work needed to have a compiler over the new data structures.

Representations and Components

We can imagine the pipeline of the K framework as a set of transformations between four distinct representations of definitions and terms, and the components which link these representations together.

  1. Text: This is the textual representation of concrete syntax which is parsed by the parser.
  2. Concrete Syntax: This is the parse tree of the textual representation which has a 1-1 correspondence with the original text that was parsed.
  3. KORE: This is the AST of what was parsed which contains a conceptual representation of the definition after having details of its specific textual representation flattened.
  4. Compiled KORE: This is the KORE of the definition after it has been run through all the compilation stages, which should contain all the information which is needed in order to be able to execute the definition.

Text -> Concrete Syntax

The first component of the system is the parser. The parser converts from the first representation, which is the text to be parsed, into the second representation, which is the concrete syntax of that text, and corresponds as closely as possible to the original text. This component in particular has two forms. The first, over definitions, takes a file and slurps up all included files, and also generates a set of parsing tables and metadata used to parse other things with the same table. The second takes such a set of parsing tables and uses them to process a particular string.

The output of this component is concrete syntax. This concrete syntax has brackets, casts, literate comments, and references to the productions used to parse each subterm. A mixfix traversal of this syntax should yield text nearly identical to the original text to be parsed. One particular optimization that may be desirable in the future is to annotate this concrete syntax with the layout information provided in whitespace and comments, which would allow exact 1-1 correspondence between the text and the parse tree.

Concrete Syntax -> KORE

The second component of the system is the flattener. This relatively small component converts from the concrete syntax output from the parser into the KORE syntax which is used by developers who want an AST instead of a parse tree. A future optimization may be the merging of this format with the KORE format using attributes to annotate all the appropriate layout information. However, to start with we will keep these components separate simply in order to slightly facilitate the refactoring process. It removes casts, brackets, and comments, and outputs a KORE representation of the definition, which should contain all the information needed in order to perform any arbitrary computations over the term. In particular, the only information lost by the flattener should be the specific textual representation of the original definition. There might also be a loss of information concerning the specific representation of information in the outer syntax, because KORE simplifies the syntax of definitions slightly (e.g. by removing the | grouping of productions).

KORE -> compiled KORE

The third component of the system is the compiler. This is the component which performs transformations over the input and generates all the data needed for krun to be able to execute programs. The output of this stage is purely binary and has no textual representation, but in particular may contain any extra data needed to execute functionality

Grigore: I do NOT like this at all. This binary format will only give developers an excuse to throw all kind of garbage into that binary, as it happens now with the Context object. Why the urge to have such a binary format NOW? I think it is a strategic mistake at this stage (same like over-optimizing a system too early in the process). We should instead spit out a textual KORE, using the KORE pretty-printer. KORE is so simple, that its parser should be extremely fast. I know you can make an argument that no matter how fast it will be, the binary format will be even faster, but that kind of ms speed loss is far from our biggest problems. Let us wait until that becomes a problem and then we will switch to a binary format to be passed between kompile and krn; when we get to that point that our biggest problem is the few ms performance loss at krun's startup, I will be already the happiest person on earth :-). So, again, here is my position on this: all the various stages of the kompiler should take KORE to KORE (the same(!) KORE), and at any point, including at the end, we should be able to pretty-print the current version of the kompiled definition using the unique KORE pretty-printer. This will give our users a simple mental model, our developers a simple approach to debugging, and us, the designers, a simple way to explain and improve our overall design. Same like in the overall design of LLVM, where the various compilation stages and even tools are nothing but translators from LLVM IR to LLVM IR. Dwight: The problem with this is that we will essentially always have to compile anything that was outputted by a compiler pass that occurred incrementally. For one thing, we will never be able to know by examining the KORE whether a particular compiler pass has been run, so it will have to be run again. Also, the information needed to execute in a particular backend will simply never be present in KORE itself, not unless we choose to significantly increase the overhead of booting krun, which I think you can agree is a mistake that Maude made that we do not want to replicate. There will be specific portions of the representation of terms in the executable format which are simply not present in KORE, such as representation of Maps as data structures, representation of builtins as native data types, computation of tables needed by the Java rewrite engine, etc. I agree that there should be a place in the architecture for compiler passes which transform KORE to KORE, I simply think that after such passes are run there will have to be other passes as well.

Input compiled KORE -> output compiled KORE

The fourth component of the system is the executor, or krun backend. This takes the input configuration, which is constructed by the krun frontend, and performs whatever analysis is specified by krun (eg execution, search, model checking, etc). The input configuration can be specified as text, which will be parsed, flattened, and compiled, or as KORE, which will be compiled directly. The output configuration will be stored as compiled KORE in order to allow it to be executed further, but the primary API to provide it as output to users will provide it in KORE format.

compiled KORE -> KORE

This component undoes the transformations of the compiler in order to output data which can be treated as KORE (ie reversing any optimizations in the representation of the definition or term). The output of this should be the exact same format as the output of the flattener, which is also the same output as the kast tool.

KORE -> Concrete Syntax

This is the unparser. The unparser has the task of attempting to determine the optimal textual representation of the KORE term by reverting to productions, adding brackets and casts, etc. The output of this phase should be able to be converted to text using a simple mixfix traversal.

Concrete Syntax -> Text

This is the visitor which does the aforementioned mixfix traversal.

Design of the new data structures

The data structures needed in order to implement these three internal representations will not be dealt with in detail in this document. The primary constraints that they must obey are as follows:

  • Must have visitors over each representation (see details below)
  • Must use immutable data structure types for lists/maps/etc
  • Must be immutable and have immutable typed attributes
  • Must be the only source of information about the definition (ie no separate Context object)
  • Should not affect the build system of developers who do not wish to develop the data structure classes in any way

Design of the new visitors

Currently in the implementation visitors take a Context object which is used for a variety of purposes including the task of constructing new terms. This will be replaced with a mutable visitor state containing information about the path to the node being visited. In particular, when visiting a module, we will have access to the definition. When visiting a sentence, we will have access to the module. When visiting child nodes of a sentence, we will have access to the sentence.

The visitor implementation will have the task of tracking this information as it traverses the tree, as well as allowing it to be constructed manually at the time of the constructor in order to visit child nodes in the absence of visiting the entire definition or module or sentence. The information that is in the Context object right now will be replaced with methods and attributes on the Definition and Module objects which update incrementally as the definition is transformed.

In particular, it is the responsibility of the transformer to provide an API for deleting and adding sentences, which API will internally update whatever representations are needed in order to keep this information up to date. The transformer will internally call these methods when various values are returned from the visit methods.

Migration path

Part 1 (Front-end)

The purpose of this path is to unblock Radu's work on the frontend by changing the components of the current implementation which are blocking modular parsing. @radumereuta and @osa1 will work with me to come up with a specific implementation for the relevant features of the design.

  1. Add fields and constructors to the visitors to support constructing a visitor in a particular definition, module, or sentence context, and to support accessing the current context of the visitor from a visit method (which updates as it traverses)
  2. Remove the Context field from AbstractVisitor and replace it with a ModuleContext and DefinitionContext fields on the Module and Definition classes. Compute this information statically at parse-time on a per-module basis, combining each module incrementally. Users access module-specific data at the definition level by accessing the relevant information in the definition's main module.

Part 2 (data structures and visitors)

This path begins in parallel with the front end path, and combines after the first path is complete with @osa1 helping @cos to complete the rest of it in the optimum way with whatever parallelism may be possible.

  1. Create and merge the new data structures.
  2. Create and merge the visitors for the new data structures in their final form (except for incremental computation of data as the definition changes)
  3. Create converters from the old KIL to the new KORE and from the new KORE to the old KIL. This will require determining the form the context information should take in the new KORE, ie, determine what information should become attributes, and what information should become incrementally updated indices on the data structure itself.
  4. Attempt to create and merge a hybrid implementation which puts these two transformations at either the beginning or end of the compilation pipeline (to be determined later) and passes all existing tests.
  5. Gradually step the portion using the new KORE outward until it covers the entire compilation pipeline.
  6. Create a converter from the new KORE to the java backend KIL, and from the old frontend representation to the new KORE.
  7. Delete all parts of the old KIL not used by the frontend.

Part 3 (Frontend again)

  1. Create a new representation for the frontend concrete syntax, including visitors and a transformation from the new concrete syntax to the new KORE. This representation should share the same outer syntax as the new KORE.
  2. Create a hybrid implementation which puts transformations from the old concrete syntax to the new concrete syntax and from the new concrete syntax to the new KORE at the end of the parser pipeline.
  3. Gradually step the portion using the new concrete syntax outward until it covers the entire parser pipeline.