Design of the new compiler parser

Please also consult https://github.com/kframework/k/wiki/KAST-and-KORE if interested in more details about KORE and how translate K to KORE. The algorithm used by the new parser is described in https://github.com/kframework/k/wiki/parsing-algorithm.pdf

Since it's a little complex I (@dwightguth) decided to write up a wiki page for the design of the implementation of the compiler and parser once we have completed all the current set of refactorings. I would like in particular @radumereuta, @cos, and @osa1 to review this design and let me know if it is satisfactory to them. I also include a section at the end corresponding to two parallel paths of work on the implementation which begin the migration path we are dealing with. I would like @osa1 and @radumereuta to work on the first of these two paths which begins the work needed to move forward with the new frontend, and @cos to work on the second of these two paths, which begins the work needed to have a compiler over the new data structures.

Representations and Components

We can imagine the pipeline of the K framework as a set of transformations between four distinct representations of definitions and terms, and the components which link these representations together.

Text: This is the textual representation of concrete syntax which is parsed by the parser.
Concrete Syntax: This is the parse tree of the textual representation which has a 1-1 correspondence with the original text that was parsed.
Textual KORE: This is the textual representation of the KORE format. It is designed for fast unambiguous parsing and pretty printing.
KORE: This is the AST of what was parsed which contains a conceptual representation of the definition after having details of its specific textual representation flattened. It maintains internal representations of builtin data types to ease use in the compiler.

Text -> Concrete Syntax

The first component of the system is the parser. The parser converts from the first representation, which is the text to be parsed, into the second representation, which is the concrete syntax of that text, and corresponds as closely as possible to the original text. This component in particular has two forms. The first, over definitions, takes a file and slurps up all included files, and also generates a set of parsing tables and metadata used to parse other things with the same table. The second takes such a set of parsing tables and uses them to process a particular string.

The output of this component is concrete syntax. This concrete syntax has brackets, casts, literate comments, and references to the productions used to parse each subterm. A mixfix traversal of this syntax should yield text nearly identical to the original text to be parsed. One particular optimization that may be desirable in the future is to annotate this concrete syntax with the layout information provided in whitespace and comments, which would allow exact 1-1 correspondence between the text and the parse tree.

Concrete Syntax -> KORE

The second component of the system is the flattener. This relatively small component converts from the concrete syntax output from the parser into the KORE syntax which is used by developers who want an AST instead of a parse tree. A future optimization may be the merging of this format with the KORE format using attributes to annotate all the appropriate layout information. However, to start with we will keep these components separate simply in order to slightly facilitate the refactoring process. It removes casts, brackets, and comments, and outputs a KORE representation of the definition, which should contain all the information needed in order to perform any arbitrary computations over the term. In particular, the only information lost by the flattener should be the specific textual representation of the original definition. There might also be a loss of information concerning the specific representation of information in the outer syntax, because KORE simplifies the syntax of definitions slightly (e.g. by removing the | grouping of productions).

KORE -> KORE

The third component of the system is the compiler. This is the component which performs transformations over the input and generates all the data needed for krun to be able to execute programs. The output of this stage is in the same format as the input, but potentially transformed and annotated with information to allow it to be more efficiently executable. It should be possible for developers to serialize the definition immediately before passing it to the compiler backend, load it back, and obtain the exact same set of information.

KORE -> Backend representation

The fourth component is the compiler backend. This component takes the transformed KORE and performs backend-specific postprocessing on it, outputting an internal representation in whatever format is needed to execute. For example, the maude backend outputs maude files, and the java backend outputs a serialized representation of the definition.

The code of this component is also shared with the next component, which uses it to convert input configurations to programs to the backend representation.

Input KORE -> output KORE

The fifth component of the system is the executor, or krun backend. This takes the input configuration, which is constructed by the krun frontend, and performs whatever analysis is specified by krun (eg execution, search, model checking, etc). The input configuration can be specified as text, which will be parsed, flattened, and compiled, or as KORE, which will be compiled directly. The output configuration should be able to be passed back to the system to be executed further, as well as be serialized to text in the KORE format. The backend is free to convert internally to its own representation of the input, and then from its own representation of the output back to KORE.

KORE -> Concrete Syntax

This is the unparser. The unparser has the task of attempting to determine the optimal textual representation of the KORE term by reverting to productions, adding brackets and casts, etc. The output of this phase should be able to be converted to text using a simple mixfix traversal.

Concrete Syntax -> Text

This is the visitor which does the aforementioned mixfix traversal.

Design of the new data structures

The data structures needed in order to implement these three internal representations will not be dealt with in detail in this document. The primary constraints that they must obey are as follows:

Must have visitors over each representation (see details below)
Must use immutable data structure types for lists/maps/etc
Must be immutable and have immutable typed attributes
Must be the only source of information about the definition (ie no separate Context object)
Should not affect the build system of developers who do not wish to develop the data structure classes in any way

Design of the new visitors

Currently in the implementation visitors take a Context object which is used for a variety of purposes including the task of constructing new terms. This will be replaced with a mutable visitor state containing information about the path to the node being visited. In particular, when visiting a module, we will have access to the definition. When visiting a sentence, we will have access to the module. When visiting child nodes of a sentence, we will have access to the sentence.

The visitor implementation will have the task of tracking this information as it traverses the tree, as well as allowing it to be constructed manually at the time of the constructor in order to visit child nodes in the absence of visiting the entire definition or module or sentence. The information that is in the Context object right now will be replaced with methods and attributes on the Definition and Module objects which update incrementally as the definition is transformed.

In particular, it is the responsibility of the transformer to provide an API for deleting and adding sentences, which API will internally update whatever representations are needed in order to keep this information up to date. The transformer will internally call these methods when various values are returned from the visit methods.

Migration path

Part 1 (Front-end)

The purpose of this path is to unblock Radu's work on the frontend by changing the components of the current implementation which are blocking modular parsing. @radumereuta and @osa1 will work with me to come up with a specific implementation for the relevant features of the design.

Add fields and constructors to the visitors to support constructing a visitor in a particular definition, module, or sentence context, and to support accessing the current context of the visitor from a visit method (which updates as it traverses) (done: #1100)
Remove the Context field from AbstractVisitor and replace it with a ModuleContext and DefinitionContext fields on the Module and Definition classes. Compute this information statically at parse-time on a per-module basis, combining each module incrementally. Users access module-specific data at the definition level by accessing the relevant information in the definition's main module.

Part 2 (data structures and visitors)

This path begins in parallel with the front end path, and combines after the first path is complete with @osa1 helping @cos to complete the rest of it in the optimum way with whatever parallelism may be possible.

Create and merge the new data structures.
Create and merge the visitors for the new data structures in their final form (except for incremental computation of data as the definition changes)
Migrate the Java backend macro expansion to happen over KIL in the compiler phase.
Create converters from the old KIL to the new KORE and from the new KORE to the old KIL. This will require determining the form the context information should take in the new KORE, ie, determine what information should become attributes, and what information should become incrementally updated indices on the data structure itself.
Attempt to create and merge a hybrid implementation which puts these two transformations at either the beginning or end of the compilation pipeline (to be determined later) and passes all existing tests. When the pipeline reaches the end of compilation, change the code to initialize the configuration to take it from the macro-expanded initializer.
Gradually step the portion using the new KORE outward until it covers the entire compilation pipeline.
Create a converter from the new KORE to the java backend KIL, and from the old frontend representation to the new KORE.
Delete all parts of the old KIL not used by the frontend.

Part 3 (Frontend again)

Create a new representation for the frontend concrete syntax, including visitors and a transformation from the new concrete syntax to the new KORE. This representation should share the same outer syntax as the new KORE.
Create a hybrid implementation which puts transformations from the old concrete syntax to the new concrete syntax and from the new concrete syntax to the new KORE at the end of the parser pipeline.
Gradually step the portion using the new concrete syntax outward until it covers the entire parser pipeline.

Breakdown of Context fields

To be moved to Definition and then modified in KORE:

Map<String, Cell> cells: to be replaced in KORE with index of cell productions by cell label
Map<String, Sort> cellSorts: to be removed in KORE and replaced with access to nonterminals of cell labels
Map<String, Sort> configVarSorts: to be replaced in KORE with automatically updated index
BiMap<String, Production> conses: to be removed in new parser
ConfigurationStructureMap configurationStructureMap: to be replaced in KORE with attributes on definition
int maxConfigurationLevel: to be replaced in KORE with attributes on definition

To be moved to Module and made to compute transitive closure, then modified in KORE:

Set<Production> productions: to be replaced in KORE with automatically updated index
SetMultiMap<String, Production> klabels: to be replaced in KORE with automatically updated index
Map<Sort, Production> listProductions: to be removed in KORE since lists become just a special case of other productions
SetMultimap<String, Production> listKLabels: to be removed in KORE since lists become just a special case of other productions
Map<Sort, Production> canonicalBracketForSort: to be replaced in KORE with automatically updated index
Poset<Sort> subsorts: to be replaced in KORE with automatically updated index
Poset<Sort> syntacticSubsorts: to be replaced in KORE with automatically updated index
Poset<String> priorities: to be replaced in KORE with automatically updated index
Poset<String> assocLeft: to be replaced in KORE with automatically updated index
Poset<String> assecRight: to be replaced in KORE with automatically updated index
HashMap<Sort, String> freshFunctionNames: to be replaced in KORE with attributes on the module.
Map<Sort, DataStructureSort> dataStructureSorts: to be replaced in KORE assoc, comm, etc attributes
Set<Sort> tokenSorts: to be replaced in KORE with automatically updated index

To be removed now

Sort startSymbolPgm: remove now and replace with configVarSorts
SetMultiMap<String, Production> tags: remove now and move computation to SDF parser.
Map<String, ASTNode> locations: remove now and move computation to Maude executor.
Set<Sort> definedSorts: remove now and replace with subsorts.elements
boolean initialized: remove now
`Map<String, CellDataStructure> cellDataStructures: remove now and move data to attribute on Cell declaration
Set<Sort> variableTokenSorts: remove now along with related feature.
int numModules, numSentences, numProductions, numCells: move to KompileFrontEnd.

To be kept in Context for the time being

FileUtil
GlobalOptions
KompileOptions
KRunOptions
ColorOptions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly