Skip to content

Commit

Permalink
Finish the bulk of pre-parser simplification
Browse files Browse the repository at this point in the history
  • Loading branch information
roccojiang committed Jun 10, 2024
1 parent d025700 commit 1427a4d
Show file tree
Hide file tree
Showing 4 changed files with 92 additions and 66 deletions.
Binary file modified src/body/impl/parser.pdf
Binary file not shown.
57 changes: 0 additions & 57 deletions src/body/impl/parser.tex
Original file line number Diff line number Diff line change
Expand Up @@ -22,63 +22,6 @@ \section{Representing and Simplifying Parsers}\label{sec:parser-representation}

% TODO: fix the above "intro" ------------------------------------------------------------------------------


\subsection{Converting Scalameta Terms to the Parser \textsc{adt}}
Having identified the \textsc{ast} nodes which represent parsers, they need to be transformed into the appropriate \scala{Parser} representation.
This involves pattern matching on the \scala{scala.meta.Term} to determine which parser combinator it represents, and then constructing the appropriate \scala{Parser} instance.

Each \scala{Parser} defines a partial function, \scala{fromTerm}, which creates an instance of that parser from the appropriate \scala{scala.meta.Term}.
These \scala{fromTerm} methods are combined to define a \scala{toParser} extension method on \scala{scala.meta.Term} -- this is where \textsc{ast} nodes are lifted to their corresponding \scala{Parser} representation.
% Use Scalafix's \scala{SymbolMatcher} to match tree nodes that resolve to a specific set of symbols.
% This makes use of semantic information from SemanticDB, so we are sure that a \scala{<*>} is actually within the \scala{parsley.Parsley} package, rather than some other function with the same name.
% This is much more robust compared to HLint, which suffers from false positives due to its reliance on syntactic information only.

The top-level combinator that makes up \scala{expr}'s definition is the choice combinator, \scala{|}.
Scalameta represents this infix application of the \scala{|} operator as so:
\begin{minted}{scala}
Term.ApplyInfix(
lhs = Term.Apply(...), // AST node for (expr, string("a")).zipped(_ + _)
op = Term.Name("|"),
targClause = Type.ArgClause(List()),
argClause = Term.ArgClause(
List(
Term.Apply(
Term.Name("string"),
Term.ArgClause(List(Lit.String("b")), None)
)
),
None
)
)
\end{minted}
%
This structure therefore guides the implementation of the pattern match in \scala{Choice.fromTerm}:
\begin{minted}{scala}
object Choice {
val matcher = SymbolMatcher.normalized("parsley.Parsley.`|`", "parsley.Parsley.`<|>`")

def fromTerm(implicit doc: SemanticDocument): PartialFunction[Term, Choice] = {
case Term.ApplyInfix(p, matcher(_), _, Term.ArgClause(List(q), _)) =>
Choice(p.toParser, q.toParser)
}
}
\end{minted}
%
The definition of this method is fairly self-explanatory: it matches on a \scala{ApplyInfix} term where the operator is the \scala{|} combinator, and recursively applies \scala{toParser} to its \textsc{lhs} and \textsc{rhs} nodes.
Finishing off, the \scala{expr} parser is therefore converted to the following \scala{Parser} instance:
% Choice(
% Zipped(Function(_ + _), List(NonTerminal(expr), Str(a))),
% Str(b)
% )
\begin{minted}[escapeinside=\%\%]{scala}
Choice(
Zipped(%\textcolor{gray}{Function(\_ + \_)}%, List(NonTerminal(expr), Str(a))),
Str(b)
)
\end{minted}
The exact representation of the \scala{Function} is not important at this momenet -- this is covered in the next \namecref{sec:function-representation}.
For brevity, the remaining code snippets in this \namecref{sec:parser-representation} will simplify the function representations and continue to grey them out.

\subsection{Building New Parsers From Existing Parsers}
Now that raw \textsc{ast} terms can be lifted to the higher-level parser representation, it is easy to build new parsers from existing parsers.
This is crucial for left-recursion factoring, which ``unfolds'' parsers into separate parsers representing the left-recursive and non-left-recursive parts.
Expand Down
Binary file modified src/body/leftrec.pdf
Binary file not shown.
101 changes: 92 additions & 9 deletions src/body/leftrec.tex
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ \section{Implementation}
lazy val example: Parsley[String] = (example, string("a")).zipped(_ + _) | string("b")
\end{minted}

\subsection{The Need for an Intermediate \textsc{ast}}
\subsection{The Need for an Intermediate \textsc{ast}}\label{sec:parser-ast-motivation}
The transformations described by \textcite{baars_leftrec_2004} require an explicit representation of the grammar and production rules so that they can be inspected and manipulated before generating code.
They achieve this by representing parsers as a deep-embedded datatype in the form of an intermediate \textsc{ast}, in a similar manner to \texttt{parsley}.

Expand All @@ -37,7 +37,7 @@ \subsection{The Need for an Intermediate \textsc{ast}}
\begin{minted}{scala}
val ap = SymbolMatcher.normalized("parsley.Parsley.`<*>`")
def deconstruct(parser: Term)(implicit doc: SemanticDocument) = parser match {
def deconstruct(parser: Term) = parser match {
case Term.ApplyInfix(p, ap(_), _, Term.ArgClause(List(q), _)) => (p, q)
}
\end{minted}
Expand All @@ -56,7 +56,7 @@ \subsubsection{The Parser \textsc{adt}}
case class NonTerminal(ref: Symbol) extends Parser
case class Pure(f: Function) extends Parser
case class Pure(x: Term) extends Parser
case object Empty extends Parser
case class Ap(p: Parser, q: Parser) extends Parser
Expand All @@ -81,20 +81,41 @@ \subsubsection{The Parser \textsc{adt}}
case p <|> q => (p, q) // using extractor object
}
\end{minted}
%
As an example, the \scala{example} parser is then represented as a \scala{Parser} object resembling the following (where quasiquote notation is used to keep the lambda expression term \scala{q"_ + _"} concise):
\begin{minted}{scala}
// (example, string("a")).zipped(_ + _) | string("b")
Choice(
Zipped(
q"_ + _",
List(
NonTerminal(Sym("path/to/package/ObjectName.example.")),
Str("a")
)
),
Str("b")
)
\end{minted}
% Instead, represent parsers as an algebraic data type \textsc{adt} in the same way that Parsley itself uses a deep embedding to represent combinators as objects.
% Methods on these objects can then be used to manipulate them, and the resulting object can still be pattern matched, maintaining the static inspectability of the parsers.
% So then it's just like writing parsers in Parsley itself: \scala{p <*> q} constructs a \scala{Ap(p, q)} node which can still be pattern matched on.
% And similar to Parsley, representing everything as objects makes it easy to optimise using pattern matching on constructors.
% This representation also then gives us for free the implementation for lint rules such as \emph{Simplify Complex Parsers} rule, which applies parser laws to simplify parsers.
\subsection{Lifting Scalameta Terms to the Intermediate Parser \textsc{ast}}
\subsection{Lifting to the Intermediate Parser \textsc{ast}}
Converting the raw Scala \textsc{ast} to the intermediate \textsc{ast} therefore requires the following basic operations:
\begin{enumerate}
\item Identifying all named parsers defined in the source program -- these correspond to non-terminal symbols in the grammar.
\item Lifting the definition each parser into the intermediate \textsc{ast}, as a \scala{Parser} object.
\item Building a map to represent the high-level grammar: the unique symbol of each named parser is mapped to its corresponding \scala{Parser} object and a reference to its original node in the Scala \textsc{ast}.
\item Collecting these into a map to represent the high-level grammar: the unique symbol of each named parser is mapped to its corresponding \scala{Parser} object, along with some extra meta-information required for the transformation.
\end{enumerate}
%
Most importantly, this meta-information includes a reference to a parser's original node in the Scala \textsc{ast}, so that any lint diagnostics or code rewrites can be applied to the correct location in the source file.
This is simply defined as:
\begin{minted}{scala}
case class ParserDefn(name: Term.Name, parser: Parser, tpe: Type.Name, originalTree: Term)
\end{minted}
\subsubsection{Identifying Named Parsers}
Finding \textsc{ast} nodes corresponding to the definition sites of named parsers involves pattern matching on \scala{val}, \scala{var}, and \scala{def} definitions with a type inferred to be some \scala{Parsley[_]}.
Expand All @@ -105,7 +126,7 @@ \subsubsection{Identifying Named Parsers}
// ^^^^ ^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
// mods pats decltpe rhs
val tree = Defn.Val(
val exampleTree = Defn.Val(
mods = List(Mod.Lazy()),
pats = List(Pat.Var(Term.Name("example"))),
decltpe = Some(
Expand All @@ -128,7 +149,7 @@ \subsubsection{Identifying Named Parsers}
println(s"type = $returnType")
println(s"structure of type object = ${returnType.structure}")
}
}
}
// qualified symbol = path/to/package/ObjectName.example.
// type = Parsley[String]
// structure of type object = TypeRef(
Expand All @@ -137,10 +158,72 @@ \subsubsection{Identifying Named Parsers}
// List(TypeRef(NoType, Symbol("scala/Predef.String#"), List()))
// )
\end{minted}
Having identified that the type of this \textsc{ast} node is \scala{Parsley[String]}, \texttt{parsley-garnish} can then proceed to convert the \scala{rhs} term into a \scala{Parser} \textsc{adt} object.
Seeing that the type of this \textsc{ast} node is \scala{Parsley[String]}, \texttt{parsley-garnish} can then proceed to convert the \scala{rhs} term into a \scala{Parser} \textsc{adt} object.
The map entry uses the fully qualified symbol for \scala{example} as the key, and the lifted \scala{Parser} object as the value.
It also includes a reference to the original \scala{rhs} term so that any lint diagnostics or code rewrites can be applied to the correct location in the source file.
% Thus, a full traversal through the source file builds a map of all named parsers, representing all non-terminals in the grammar defined within that file.
\subsubsection{Converting Scalameta Terms to the Parser \textsc{adt}}
Having identified the \textsc{ast} nodes which represent parsers, they need to be transformed into the appropriate \scala{Parser} representation.
This involves pattern matching on the \scala{scala.meta.Term} to determine which parser combinator it represents, and then constructing the appropriate \scala{Parser} instance.
Each \scala{Parser} defines a partial function \scala{fromTerm} to instantiate a parser from the appropriate \scala{scala.meta.Term}.
These \scala{fromTerm} methods perform the ugly work of pattern matching on the low-level syntactic constructs of the Scala \textsc{ast}.
All \scala{fromTerm} methods are combined to define the \scala{toParser} extension method on \scala{scala.meta.Term} -- this is where \textsc{ast} nodes are lifted to their corresponding \scala{Parser} representation.
The pattern matching example from \cref{sec:parser-ast-motivation} makes a reappearance in the definition of \scala{Ap.fromTerm}, where the arguments to the \scala{<*>} combinator are recursively lifted to \scala{Parser} objects:
% Use Scalafix's \scala{SymbolMatcher} to match tree nodes that resolve to a specific set of symbols.
% This makes use of semantic information from SemanticDB, so we are sure that a \scala{<*>} is actually within the \scala{parsley.Parsley} package, rather than some other function with the same name.
% This is much more robust compared to HLint, which suffers from false positives due to its reliance on syntactic information only.
\begin{minted}{scala}
// Type signatures in Parsley:
// p: Parsley[A => B], q: =>Parsley[A], p <*> q: Parsley[B]
case class Ap(p: Parser, q: Parser) extends Parser
object Ap {
val matcher = SymbolMatcher.normalized("parsley.Parsley.`<*>`")
def fromTerm: PartialFunction[Term, Ap] = {
case Term.ApplyInfix(p, matcher(_), _, Term.ArgClause(List(q), _)) =>
Ap(p.toParser, q.toParser)
}
}
\end{minted}
%
Where a combinator takes a non-parser argument, this is treated as a black box and kept as a raw \textsc{ast} node:
\begin{minted}{scala}
// x: A, pure(x): Parsley[A]
case class Pure(x: Term) extends Parser
object Pure {
val matcher = SymbolMatcher.normalized("parsley.ParsleyImpl.pure")
def fromTerm: PartialFunction[Term, Pure] = {
case Term.Apply(matcher(_), Term.ArgClause(List(expr), _)) => Pure(expr)
}
}
\end{minted}
\subsubsection{Building the Grammar Map}
The overall process of converting the source file \textsc{ast} to a high-level map of the grammar can therefore be expressed as a single traversal over the \textsc{ast}:
\begin{minted}{scala}
object VariableDecl {
def unapply(tree: Tree): ParserDefn = tree match {
case Defn.Val(_, List(Pat.Var(varName)), _, body) if isParsleyType(varName) =>
ParserDefn(
name = varName,
parser = body.toParser,
tpe = getParsleyType(varName),
originalTree = body
)
// similar cases for Defn.Var and Defn.Def
}
}
val nonTerminals: Map[Symbol, ParserDefn] = doc.tree.collect {
case VariableDecl(parserDef) => parserDefn.name.symbol -> parserDef
}.toMap
\end{minted}
\subsection{Implementing the Left-Recursion Transformation}
\TODO{TODO}
\end{document}

0 comments on commit 1427a4d

Please sign in to comment.