Skip to content

Commit

Permalink
Move stuff around into parser
Browse files Browse the repository at this point in the history
  • Loading branch information
roccojiang committed Jun 10, 2024
1 parent ecadac0 commit d025700
Show file tree
Hide file tree
Showing 7 changed files with 163 additions and 65 deletions.
Binary file modified main.pdf
Binary file not shown.
11 changes: 6 additions & 5 deletions main.tex
Original file line number Diff line number Diff line change
Expand Up @@ -256,11 +256,12 @@

\subfile{src/introduction/acknowledgements}%

\pagebreak
\tableofcontents
\pagebreak
\listoffigures
% \listoftables % TODO: uncomment if I have tables
% It's likely better to exclude these sections -- https://edstem.org/us/courses/46827/discussion/5031345
% \pagebreak
% \tableofcontents
% \pagebreak
% \listoffigures
% \listoftables

\pagebreak
\pagenumbering{arabic}
Expand Down
18 changes: 18 additions & 0 deletions src/bibliography.bib
Original file line number Diff line number Diff line change
Expand Up @@ -765,3 +765,21 @@ @inbook{amin_essence_2016
doi = {10.1007/978-3-319-30936-1_14},
url = {https://doi.org/10.1007/978-3-319-30936-1_14}
}

@inproceedings{baars_leftrec_2004,
author = {Baars, Arthur I. and Swierstra, S. Doaitse},
title = {Type-safe, self inspecting code},
year = {2004},
isbn = {1581138504},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1017472.1017485},
doi = {10.1145/1017472.1017485},
abstract = {We present techniques for representing typed abstract syntax trees in the presence of observable recursive structures. The need for this arose from the desire to cope with left-recursion in combinator based parsers. The techniques employed can be used in a much wider setting however, since it enables the inspection and transformation of any program structure, which contains internal references. The hard part of the work is to perform such analyses and transformations in a setting in which the Haskell type checker is still able to statically check the correctness of the program representations, and hence the type correctness of the transformed program.},
booktitle = {Proceedings of the 2004 ACM SIGPLAN Workshop on Haskell},
pages = {69--79},
numpages = {11},
keywords = {compilers, domain specific languages, left-recursion, top-down parsing},
location = {Snowbird, Utah, USA},
series = {Haskell '04}
}
Binary file modified src/body/impl/parser.pdf
Binary file not shown.
60 changes: 0 additions & 60 deletions src/body/impl/parser.tex
Original file line number Diff line number Diff line change
Expand Up @@ -17,71 +17,11 @@ \section{Representing and Simplifying Parsers}\label{sec:parser-representation}

% TODO: our parser representation is akin to Haskell parsley's deep-embedded combinator tree, albeit representing all combinators rather than just the core ones

For example, given two \textsc{ast} nodes \scala{Term.Name("p")} and \scala{Term.Name("q")} corresponding to named parsers \scala{p} and \scala{q}, suppose a transformation involves combining them with the \emph{ap} combinator \scala{<*>}.
One may consider using quasiquotes to achieve this: \scala{q"p <*> q"} would automatically expand to \scala{Term.ApplyInfix(Term.Name("p"), Term.Name("<*>"), Type.ArgClause(Nil), Term.ArgClause(List(Term.Name("q")), None))}.
However, this loses the static inspectability of the individual parsers \scala{p} and \scala{q} -- although quasiquotes can be used as extractor patterns to recover the original \textsc{ast} nodes, their usage as such is discouraged as they can easily result in unintended match errors. % TODO: cite? footnote? https://scalameta.org/docs/trees/guide.html#with-quasiquotes-1
The recommended approach is to pattern match on the \textsc{ast} nodes directly, which is obviously unergonomic even for this small example: to extract the \textsc{rhs} term \scala{q}, one would have to perform a nested pattern match on the \scala{Term.ApplyInfix} term and its \scala{Term.ArgClause} node representing the arguments of the infix function application.
It is hopefully obvious that this would a very painful process for the rule author.
It would be desirable to abstract away from the low-level syntactic \textsc{ast} representation, and instead treat these \textsc{ast} nodes as what they semantically represent -- parsers.
Instead, \cref{fig:parser-adt} shows how parser terms can be represented as an algebraic data type \textsc{adt}, in the same way \texttt{parsley} itself uses a deep embedding to represent parsers as pure data objects.
The reasoning behind this approach is the same as that for \textsc{parsley} -- this representation allows parsers to be easily inspected and analysed via pattern matching on constructors.
\begin{figure}[htbp]
\begin{minted}{scala}
trait Parser
case class NonTerminal(ref: Symbol) extends Parser
case class Pure(f: Function) extends Parser
case object Empty extends Parser
case class Choice(p: Parser, q: Parser) extends Parser
case class Ap(p: Parser, q: Parser) extends Parser
...
\end{minted}
\caption{A subset of the core combinators in the \scala{Parser} \textsc{adt}.}
\label{fig:parser-adt}
\end{figure}
% Instead, represent parsers as an algebraic data type \textsc{adt} in the same way that Parsley itself uses a deep embedding to represent combinators as objects.
% Methods on these objects can then be used to manipulate them, and the resulting object can still be pattern matched, maintaining the static inspectability of the parsers.
% So then it's just like writing parsers in Parsley itself: \scala{p <*> q} constructs a \scala{Ap(p, q)} node which can still be pattern matched on.
% And similar to Parsley, representing everything as objects makes it easy to optimise using pattern matching on constructors.
% This representation also then gives us for free the implementation for lint rules such as \emph{Simplify Complex Parsers} rule, which applies parser laws to simplify parsers.
This \namecref{sec:parser-representation} explores the motivation behind this and the design choices made in the implementation.
Use the left-recursion factoring~(\cref{sec:factor-leftrec}) rule as a basis/context to demonstrate the utility of this representation.

% TODO: fix the above "intro" ------------------------------------------------------------------------------

\paragraph{Running example}
The left-recursion factoring rule~(\cref{sec:factor-leftrec}) performs the most complex analyses and transformations on parsers in \texttt{parsley-garnish}.
Thus, it is a good example to motivate the design requirements for the parser representation.
The following left-recursive parser and its transformation into its \scala{postfix} form will serve as a running example for this \namecref{sec:parser-representation}:
\begin{minted}{scala}
lazy val expr: Parsley[String] = (expr, string("a")).zipped(_ + _) | string("b")
\end{minted}
\subsection{Detecting Named Parsers}
Before any analysis on parsers can be performed, it is first necessary to identify which \textsc{ast} nodes correspond to parsers.
\texttt{parsley-garnish} builds a map of all parsers defined within a source file, indexed by the unique symbol of its name.
Identifying these \textsc{ast} nodes of interest involves pattern matching on \scala{val}, \scala{var}, and \scala{def} definitions with a type inferred to be some \scala{Parsley[_]} -- this information is accessed by querying the Scalafix semantic \textsc{api} for the node's symbol information.
% In this example, the type of \scala{expr} is explicitly given as the Scala compiler requires this due to being a recursive definition.
Consider the labelled \scala{ast} structure of the \scala{expr} parser:
\begin{minted}{scala}
Defn.Val(
mods = List(Mod.Lazy()),
pats = List(Pat.Var(Term.Name("expr"))),
decltpe = Some(
Type.Apply(Type.Name("Parsley"), Type.ArgClause(List(Type.Name("String"))))
),
rhs = Term.ApplyInfix(...)
)
\end{minted}
%
The qualified symbol \scala{expr} is used as the key in the map, and the \scala{rhs} term is lifted the intermediate parser representation for analysis.
A reference to the original \textsc{ast} node is also kept so any lint diagnostics or code rewrites can be applied to the correct location in the source file.
Thus, a full traversal through the source file builds a map of all named parsers, representing all non-terminals in the grammar defined within that file.

\subsection{Converting Scalameta Terms to the Parser \textsc{adt}}
Having identified the \textsc{ast} nodes which represent parsers, they need to be transformed into the appropriate \scala{Parser} representation.
Expand Down
Binary file modified src/body/leftrec.pdf
Binary file not shown.
139 changes: 139 additions & 0 deletions src/body/leftrec.tex
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,143 @@

\ourchapter{Removing Left-Recursion}\label{sec:factor-leftrec}

\section{Implementation}
\TODO{section intro}

\paragraph{Running example}
The following left-recursive parser and its transformation into its \scala{postfix} form will serve as a running example:
\begin{minted}{scala}
lazy val example: Parsley[String] = (example, string("a")).zipped(_ + _) | string("b")
\end{minted}

\subsection{The Need for an Intermediate \textsc{ast}}
The transformations described by \textcite{baars_leftrec_2004} require an explicit representation of the grammar and production rules so that they can be inspected and manipulated before generating code.
They achieve this by representing parsers as a deep-embedded datatype in the form of an intermediate \textsc{ast}, in a similar manner to \texttt{parsley}.

Since \texttt{parsley-garnish} is a linter, by nature, it has access to an explicit grammar representation in the form of the full Scala \textsc{ast} of the source program.
However, this \textsc{ast} is a general-purpose representation that becomes \TODO{hard to work with when trying to do domain-specific manipulations on grammars}.

Take for example the task of combining two \textsc{ast} nodes \scala{Term.Name("p")} and \scala{Term.Name("q")}, representing named parsers \scala{p} and \scala{q}, with the \emph{ap} combinator \scala{<*>}.
This operation can be concisely expressed with Scalameta quasiquotes, rather than manually writing out the full explicit \textsc{ast}:
\begin{minted}{scala}
q"p <*> q" ==
Term.ApplyInfix(
Term.Name("p"),
Term.Name("<*>"),
Type.ArgClause(Nil),
Term.ArgClause(List(Term.Name("q")), None)
)
\end{minted}
However, the reverse operation of inspecting the individual parsers \scala{p} and \scala{q} is not as straightforward.
Although quasiquotes can be used as extractor patterns in pattern matching, this usage is discouraged due to limitations in the quasiquote design that makes it easy to accidentally introduce match errors\footnote{\url{https://scalameta.org/docs/trees/guide.html#with-quasiquotes-1}}.
Thus, extracting the parsers necessitates a long-winded pattern match like so:
\begin{minted}{scala}
val ap = SymbolMatcher.normalized("parsley.Parsley.`<*>`")
def deconstruct(parser: Term)(implicit doc: SemanticDocument) = parser match {
case Term.ApplyInfix(p, ap(_), _, Term.ArgClause(List(q), _)) => (p, q)
}
\end{minted}
This involves dealing with abstract general-purpose syntax constructs like \scala{Term.ApplyInfix} and \scala{Term.ArgClause}, which are low-level details not relevant to the task of manipulating parsers.
This is not an issue for simple one-off transformations, but for more specialised transformations like left-recursion factoring, it would be desirable to abstract away from these low-level syntactic details.
This motivates the need for an higher-level, intermediate \textsc{ast} representation that is more specialised to the domain of parser combinators.
\TODO{the past 3 sentences all start with "this", reword them}
\subsubsection{The Parser \textsc{adt}}
\texttt{parsley-garnish} uses a similar deep-embedded parser representation for the intermediate \textsc{ast} as \textcite{baars_leftrec_2004}, extended to match \texttt{parsley}'s combinators.
\Cref{fig:parser-adt} shows how this is implemented as an algebraic data type (\textsc{adt}), with extra syntactic sugar introduced by implementing \scala{unapply} methods in extractor objects.
\begin{figure}[htbp]
\begin{minted}{scala}
trait Parser
case class NonTerminal(ref: Symbol) extends Parser
case class Pure(f: Function) extends Parser
case object Empty extends Parser
case class Ap(p: Parser, q: Parser) extends Parser
object <*> {
def unapply(parser: Ap): Option[(Parser, Parser)] = Some((parser.p, parser.q))
}
case class Choice(p: Parser, q: Parser) extends Parser
object <|> {
def unapply(parser: Choice): Option[(Parser, Parser)] = Some((parser.p, parser.q))
}
\end{minted}
\caption{A subset of the core combinators in the \scala{Parser} \textsc{adt}.}
\label{fig:parser-adt}
\end{figure}
All \scala{Parser} types represent \texttt{parsley} combinators, with the exception of \scala{NonTerminal} to represent a reference to a named parser.
Inspecting parsers is now easily done by pattern matching on constructors and/or using the extractor objects:
\begin{minted}{scala}
def deconstruct(parser: Parser) = parser match {
case Ap(p, q) => (p, q) // using constructor
case p <|> q => (p, q) // using extractor object
}
\end{minted}
% Instead, represent parsers as an algebraic data type \textsc{adt} in the same way that Parsley itself uses a deep embedding to represent combinators as objects.
% Methods on these objects can then be used to manipulate them, and the resulting object can still be pattern matched, maintaining the static inspectability of the parsers.
% So then it's just like writing parsers in Parsley itself: \scala{p <*> q} constructs a \scala{Ap(p, q)} node which can still be pattern matched on.
% And similar to Parsley, representing everything as objects makes it easy to optimise using pattern matching on constructors.
% This representation also then gives us for free the implementation for lint rules such as \emph{Simplify Complex Parsers} rule, which applies parser laws to simplify parsers.
\subsection{Lifting Scalameta Terms to the Intermediate Parser \textsc{ast}}
Converting the raw Scala \textsc{ast} to the intermediate \textsc{ast} therefore requires the following basic operations:
\begin{enumerate}
\item Identifying all named parsers defined in the source program -- these correspond to non-terminal symbols in the grammar.
\item Lifting the definition each parser into the intermediate \textsc{ast}, as a \scala{Parser} object.
\item Building a map to represent the high-level grammar: the unique symbol of each named parser is mapped to its corresponding \scala{Parser} object and a reference to its original node in the Scala \textsc{ast}.
\end{enumerate}
\subsubsection{Identifying Named Parsers}
Finding \textsc{ast} nodes corresponding to the definition sites of named parsers involves pattern matching on \scala{val}, \scala{var}, and \scala{def} definitions with a type inferred to be some \scala{Parsley[_]}.
This type information is accessed by querying the Scalafix semantic \textsc{api} for the node's symbol information.
Consider the labelled \textsc{ast} structure of the \scala{example} parser:
\begin{minted}{scala}
// lazy val example: Parsley[String] = (example, string("a")).zipped(_ + _) | string("b")
// ^^^^ ^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
// mods pats decltpe rhs
val tree = Defn.Val(
mods = List(Mod.Lazy()),
pats = List(Pat.Var(Term.Name("example"))),
decltpe = Some(
Type.Apply(Type.Name("Parsley"), Type.ArgClause(List(Type.Name("String"))))
),
rhs = Term.ApplyInfix(...)
)
\end{minted}
%
% In this case, the type of \scala{example} is explicitly annotated by the user since this is required for recursive definitions.
% However in general, users will not explicitly annotate the types of their parsers, allowing the Scala compiler to infer the type.
Note that the \scala{decltpe} field refers to the syntax of the explicit type annotation, not the semantic information of the inferred type of the variable.
Therefore, this field will not always be present, so in the general case, the type must be queried via a symbol information lookup like so:
\begin{minted}{scala}
tree match {
case Defn.Val(_, List(Pat.Var(varName)), _, body) =>
println(s"qualified symbol = ${varName.symbol}")
varName.symbol.info.get.signature match {
case MethodSignature(_, _, returnType) =>
println(s"type = $returnType")
println(s"structure of type object = ${returnType.structure}")
}
}
// qualified symbol = path/to/package/ObjectName.example.
// type = Parsley[String]
// structure of type object = TypeRef(
// NoType,
// Symbol("parsley/Parsley#"),
// List(TypeRef(NoType, Symbol("scala/Predef.String#"), List()))
// )
\end{minted}
Having identified that the type of this \textsc{ast} node is \scala{Parsley[String]}, \texttt{parsley-garnish} can then proceed to convert the \scala{rhs} term into a \scala{Parser} \textsc{adt} object.
The map entry uses the fully qualified symbol for \scala{example} as the key, and the lifted \scala{Parser} object as the value.
It also includes a reference to the original \scala{rhs} term so that any lint diagnostics or code rewrites can be applied to the correct location in the source file.
% Thus, a full traversal through the source file builds a map of all named parsers, representing all non-terminals in the grammar defined within that file.
\end{document}

0 comments on commit d025700

Please sign in to comment.