Tidy up simplifying parsers section

roccojiang · Jun 11, 2024 · 3ca43e3 · 3ca43e3
1 parent 9841054
commit 3ca43e3
Show file tree

Hide file tree

Showing 6 changed files with 99 additions and 122 deletions.
diff --git a/src/body/impl.pdf b/src/body/impl.pdf
diff --git a/src/body/impl.tex b/src/body/impl.tex
@@ -3,12 +3,22 @@
 \begin{document}
 
 \ourchapter{Simplifying Parsers and Expressions}
-Writing domain-specific lint rules unlocks the potential for more powerful and interesting transformations utilising specialised domain knowledge.
-Desirable:
-* inspectability for analysis (that's what we're here for!) and optimisation
-The purpose of this chapter is to describe the intermediate representations of parsers (\cref{sec:parser-representation}) and functions (\cref{sec:function-representation}).
-Show that terms must be simplified to a normal form
-Demonstrate equivalence to dsl optimisations in staged metaprogramming
+Shit output from previous section. This motivates:
+\begin{itemize}
+  \item \cref{sec:simplify-parsers} discusses how parser terms can be simplified via domain-specific optimisations based on parser laws.
+  \item \cref{sec:function-representation} discusses how expressions can be partially evaluated. This is achieved using another intermediate \textsc{ast}, this time based on the $\lambda$-calculus.
+\end{itemize}
+
+% TODO
+% Writing domain-specific lint rules unlocks the potential for more powerful and interesting transformations utilising specialised domain knowledge.
+% Desirable:
+% * inspectability for analysis (that's what we're here for!) and optimisation
+% The purpose of this chapter is to describe the intermediate representations of parsers (\cref{sec:parser-representation}) and functions (\cref{sec:function-representation}).
+% Show that terms must be simplified to a normal form
+% Demonstrate equivalence to dsl optimisations in staged metaprogramming
+
+% Scalafix runs at the meta-level, outside of the phase distinction of compile- and run-time.
+% Staged metaprogramming applies optimisations at compile-time, whereas these ``optimisations'' at applied post-compilation
 
 \subfile{impl/parser}
 \subfile{impl/expr}

diff --git a/src/body/impl/parser.pdf b/src/body/impl/parser.pdf
diff --git a/src/body/impl/parser.tex b/src/body/impl/parser.tex
@@ -2,88 +2,23 @@
 
 \begin{document}
 
-\section{Representing and Simplifying Parsers}\label{sec:parser-representation}
-\TODO{
-  This is an INTERMEDIATE SYMBOLIC REPRESENTATION (?)
-  more specialised than general-purpose scala ast
-  This section is about simplifying in our semantic domain (parsers)
-
-Scalafix runs at the meta-level, outside of the phase distinction of compile- and run-time.
-Staged metaprogramming applies optimisations at compile-time, whereas these ``optimisations'' at applied post-compilation
-}
-
-% TODO: come back to this after the section body is finished
-% Several of the more complex lint rules, most notably \cref{sec:factor-leftrec}, require manipulating parser combinators in a high-level manner.
+\section{Simplifying Parsers}\label{sec:simplify-parsers}
 
 % TODO: our parser representation is akin to Haskell parsley's deep-embedded combinator tree, albeit representing all combinators rather than just the core ones
 
-This \namecref{sec:parser-representation} explores the motivation behind this and the design choices made in the implementation.
-Use the left-recursion factoring~(\cref{sec:factor-leftrec}) rule as a basis/context to demonstrate the utility of this representation.
+\TODO{This is where the deep embedding approach comes to shine: simplifications are easily expressed by pattern matching on \scala{Parser} constructors.}
+% The two only differ in the purpose of the simplification: whereas Haskell \texttt{parsley} does this to produce an optimised \textsc{ast} to be compiled as code, \texttt{parsley-garnish} simplifies the parser \textsc{ast} to be pretty-printed as text.
+\begin{itemize}
+  \item \texttt{parsley} performs rewrites on the parser \textsc{ast} to produce more optimised \emph{code}.
+  \item \texttt{parsley-garnish} performs rewrites on the parser \textsc{ast} to produce a more readable \emph{textual representation of code}.
+\end{itemize}
 
 % TODO: fix the above "intro" ------------------------------------------------------------------------------
 
-\TODO{REMOVE}
-Now that raw \textsc{ast} terms can be lifted to the higher-level parser representation, it is easy to build new parsers from existing parsers.
-This is crucial for left-recursion factoring, which ``unfolds'' parsers into separate parsers representing the left-recursive and non-left-recursive parts.
-These are then recombined to form parsers which are free from left recursion.
-
-Smart constructors are used to make manipulating parser terms resemble writing \texttt{parsley} code itself.
-These are defined as infix operators, which are written as extension methods on the \scala{Parser} trait:
-\begin{minted}{scala}
-implicit class ParserOps(private val p: Parser) extends AnyVal {
-  def <*>(q: Parser): Parser = Ap(p, q)
-  def <|>(q: Parser): Parser = Choice(p, q)
-  def map(f: Function): Parser = FMap(p, f)
-}
-\end{minted}
-%
-Parser terms can now be manipulated in a manner that looks almost indistinguishable from writing \texttt{parsley} code.
-For example, the \scala{unfold} method on the \scala{Ap} parser contains this snippet, where \scala{pl}, \scala{ql}, and \scala{q} are parsers (\scala{pe} is not a parser, but rather an \scala{Option} value):
-% val lefts = {
-%   val llr = pl.map(flip) <*> q
-%   val rlr = pe.map(f => ql.map(composeH(f))).getOrElse(Empty)
-%   llr <|> rlr
-% }
-\begin{minted}[escapeinside=\%\%]{scala}
-val lefts = {
-  val llr = pl.map(%\textcolor{gray}{flip}%) <*> q
-  val rlr = pe.map(f => ql.map(%\textcolor{gray}{composeH(f)}%)).getOrElse(Empty)
-  llr <|> rlr
-}
-\end{minted}
-Other than the capitalised \scala{Empty} constructor, this would be perfectly valid \texttt{parsley} code.
-
-\subsection{Simplifying Parsers Using Parser Laws}\label{sec:simplify-parsers}
-Recombining unfolded parsers during left-recursion factoring introduces many necessary, but extraneous ``glue'' combinators.
-Even though the transformed parser is semantically correct, it ends up very noisy syntactically.
-Consider the resulting parser from factoring out the left-recursion in \scala{expr}:
-% lazy val expr: Parsley[String] = chain.postfix(
-%   empty | (empty.map(a => b => a + b) | empty <*> expr) <*> string("a")
-%     | string("b") | empty
-% )(
-%   (empty.map(FLIP) <*> expr | pure(ID).map(COMPOSE(a => b => a + b)))
-%     .map(FLIP) <*> string("a")
-%     | empty | empty
-% )
-\begin{minted}[escapeinside=\%\%]{scala}
-lazy val expr: Parsley[String] = chain.postfix(
-  empty | (empty.map(%\textcolor{gray}{a => b => a + b}%) | empty <*> expr) <*> string("a")
-    | string("b") | empty
-)(
-  (empty.map(%\textcolor{gray}{flip}%) <*> expr | pure(%\textcolor{gray}{identity}%).map(%\textcolor{gray}{compose(a => b => a + b)}%))
-    .map(%\textcolor{gray}{flip}%) <*> string("a")
-    | empty | empty
-)
-\end{minted}
-%
-The intent of this parser is completely obfuscated -- it would be unacceptable for the output of the transformation to be left in this form.
-For human readability, this parser term must be simplified as much as possible, using domain-specific knowledge about parser combinators.
-This is where the deep embedding approach comes to shine; simplifications are easily expressed by pattern matching on \scala{Parser} constructors.
-
+\subsection{Parser Laws}
 \textcite{willis_staged_2023} note that parser combinators are subject to \emph{parser laws}, which often form a natural simplification in one direction.
-In Haskell \texttt{parsley}, \textcite{willis_parsley_2023} uses these parser laws as the basis for high-level optimisations to simplify the structure of the combinator tree.
-\texttt{parsley-garnish} uses the same principles to simplify the parser term to become more human-readable.
-The two only differ in the purpose of the simplification: whereas Haskell \texttt{parsley} does this to produce an optimised \textsc{ast} to be compiled as code, \texttt{parsley-garnish} simplifies the parser \textsc{ast} to be pretty-printed as text.
+Both \texttt{parsley} Scala~\cite{willis_garnishing_2018} and \texttt{parsley} Haskell~\cite{willis_parsley_2023} use these laws as the basis for high-level optimisations to simplify the structure of deeply-embedded parsers.
+These same principles can be used by \texttt{parsley-garnish} to simplify parser terms to be more human-readable.
 
 \Cref{fig:parser-laws} shows the subset of parser laws utilised by \texttt{parsley-garnish} for parser simplification.
 Most of the laws in \cref{fig:parser-laws} have already been shown to hold for Parsley by \textcite{willis_garnishing_2018}; an additional proof for \cref{eqn:alt-fmap-absorb} can be found in \cref{appendix:parser-law-proofs}.
@@ -109,46 +44,50 @@ \subsection{Simplifying Parsers Using Parser Laws}\label{sec:simplify-parsers}
 \label{fig:parser-laws}
 \end{figure}
 
-% TODO: vertical spacing here is a bit unsightly, maybe add a \paragraph for these "running example" bits?
-In the previous example, it is evident that the most noise results from the \scala{empty} combinators.
+\subsubsection{Simplifying the Example Parser}
+This section provides a worked example of how the parser in \cref{fig:leftrec-example-bad} is simplified using parser laws.
+Most of the noise in \cref{fig:leftrec-example-bad} comes from the large number of \scala{empty} combinators.
 These can be eliminated using \cref{eqn:alt-left-neutral,eqn:alt-right-neutral,eqn:alt-empty-absorb,eqn:alt-fmap-absorb}:
 % lazy val expr: Parsley[String] = chain.postfix(string("b"))(
-%   (pure(identity).map(compose(a => b => a + b))).map(flip) <*> string("a")
+%   (pure(identity).map(compose((_ + _).curried))).map(flip) <*> string("a")
 % )
 \begin{minted}[escapeinside=\%\%]{scala}
 lazy val expr: Parsley[String] = chain.postfix(string("b"))(
-  (pure(%\textcolor{gray}{identity}%).map(%\textcolor{gray}{compose(a => b => a + b)}%)).map(%\textcolor{gray}{flip}%) <*> string("a")
+  (pure(%\textcolor{gray}{identity}%).map(%\textcolor{gray}{compose((\_ + \_).curried)}%)).map(%\textcolor{gray}{flip}%) <*> string("a")
 )
 \end{minted}
 %
-The complicated term in the postfix operator can then be simplified as follows:
-% (pure(identity).map(compose(a => b => a + b))).map(flip) <*> string("a")
-% pure(compose(a => b => a + b)(identity)).map(flip) <*> string("a")
-% pure(flip(compose(a => b => a + b)(identity))) <*> string("a")
-% string("a").map(flip(compose(a => b => a + b)(identity)))
+This already looks a lot better, but the second parameter to \scala{postfix} can be further simplified as follows:
+% (pure(identity).map(compose((_ + _).curried))).map(flip) <*> string("a")
+% pure(compose((_ + _).curried)(identity)).map(flip) <*> string("a")
+% pure(flip(compose((_ + _).curried)(identity))) <*> string("a")
+% string("a").map(flip(compose((_ + _).curried)(identity)))
 \begin{minted}[baselinestretch=1.5,escapeinside=\%\%]{scala}
-    (pure(%\textcolor{gray}{identity}%).map(%\textcolor{gray}{compose(a => b => a + b)}%)).map(%\textcolor{gray}{flip}%) <*> string("a")
+    (pure(%\textcolor{gray}{identity}%).map(%\textcolor{gray}{compose((\_ + \_).curried)}%)).map(%\textcolor{gray}{flip}%) <*> string("a")
 % \proofstep{\cref{eqn:app-homomorphism,eqn:app-fmap}} %
-    pure(%\textcolor{gray}{compose(a => b => a + b)(identity)}%).map(%\textcolor{gray}{flip}%) <*> string("a")
+    pure(%\textcolor{gray}{compose((\_ + \_).curried)(identity)}%).map(%\textcolor{gray}{flip}%) <*> string("a")
 % \proofstep{\cref{eqn:app-homomorphism,eqn:app-fmap}} %
-    pure(%\textcolor{gray}{flip(compose(a => b => a + b)(identity))}%) <*> string("a")
+    pure(%\textcolor{gray}{flip(compose((\_ + \_).curried)(identity))}%) <*> string("a")
 % \proofstep{\cref{eqn:app-fmap}} %
-    string("a").map(%\textcolor{gray}{flip(compose(a => b => a + b)(identity))}%)
+    string("a").map(%\textcolor{gray}{flip(compose((\_ + \_).curried)(identity))}%)
 \end{minted}
 %
-This results in the most simplified form of the parser:
+The most simplified form of the parser is then:
 \begin{minted}[escapeinside=\%\%]{scala}
-val f: Function = %\textcolor{gray}{flip(compose(a => b => a + b)(identity))}%
+val f = %\textcolor{gray}{flip(compose((\_ + \_).curried)(identity))}%
 lazy val expr: Parsley[String] = chain.postfix(string("b"))(string("a").map(%\textcolor{gray}{f}%))
 \end{minted}
+%
+The parser has now been expressed in a much simplified form, in a similar style to how it would be written by hand.
+The remaining challenge is to simplify the contents of the expression \scala{f}, which is tackled in \cref{sec:function-representation}.
 
-\paragraph{Encapsulating boilerplate}
-Lawful simplifications are applied akin to peephole optimisations on the recursively defined \scala{Parser} \textsc{adt}.
-There are many instances of parsers, which inevitably leads to repetitive and error-prone boilerplate code which exists to simply recurse through each case.
-To avoid this, the recursive traversal itself is decoupled from the application of the transformation function.
-Although the traversal is still hand-written, the implementation is inspired by the generic traversal patterns offered by Haskell's \texttt{uniplate} library~\cite{mitchell_uniform_2007}.
+\subsection{Implementing Rewrites on the Parser \textsc{ast}}
+Lawful simplifications are applied by a bottom-up transformation over the recursively defined \scala{Parser} \textsc{ast}.
+Since there are many parser cases, this inevitably leads to repetitive and error-prone boilerplate code which simply exists to recursively propagate the transformation through each case.
+To avoid this, the recursive traversal itself can be decoupled from the definition of the transformation function.
+Although the traversal is still hand-written, this implementation is inspired by the generic traversal patterns offered by Haskell's \texttt{uniplate} library~\cite{mitchell_uniform_2007}.
 
-This is realised as a \scala{transform} method on the \scala{Parser} trait, which takes a partial function and applies it to nodes where it is defined.
+The traversal is realised as a \scala{transform} method on the \scala{Parser} trait, which takes a partial function and applies it to nodes where it is defined.
 The transformation is applied via a bottom-up traversal:
 \begin{minted}{scala}
 def transform(pf: PartialFunction[Parser, Parser]): Parser = {
@@ -177,31 +116,21 @@ \subsection{Simplifying Parsers Using Parser Laws}\label{sec:simplify-parsers}
 def simplify: Parser = this.rewrite {
   // p.map(f).map(g) == p.map(g compose f)
   case FMap(FMap(p, f), g) => FMap(p, composeH(g, f))
+  // pure(f) <*> pure(x) == pure(f(x))
+  case Pure(f) <*> Pure(x) => Pure(app(f, x))
   // u <|> empty == u
-  case Choice(u, Empty) => u
+  case u <|> Empty => u
   // pure(f) <|> u == pure(f)
-  case Choice(Pure(f), _) => Pure(f)
+  case Pure(f) <|> _ => Pure(f)
   ...
 }
 \end{minted}
 %
-Further design considerations are made to ensure the extensibility and safety of this approach: the \scala{Parser} trait is sealed, which enables compiler warnings if a new \scala{Parser} case is added and the \scala{transform} method is not updated.
-Since the traversal is still written by hand rather than generically derived, it is still more prone to error
-The traversal could be generically derived rather than written by hand, but this would require the use of an external dependency such as \texttt{shapeless}\footnote{\url{https://github.com/milessabin/shapeless}},
-which is overkill for the complexity of the \scala{Parser} \textsc{adt}.
-
-\subsection{Converting Parsers Back to Scalameta Terms}
-After parsers have been transformed and simplified, the last step is to convert them back to a textual representation to be applied as a Scalafix patch.
-Parsers can be lowered back to \scala{scala.meta.Term} nodes by the inverse of the original \scala{fromTerm} transformation.
-The \scala{Parser} trait defines this transformation as the method \scala{term}, using quasiquotes to simplify the construction of the \scala{scala.meta.Term} nodes.
-\begin{minted}{scala}
-case class Zipped(func: Function, parsers: List[Parser]) extends Parser {
-  val term: Term = q"(..${parsers.map(_.term)}).zipped(${func.term})"
-}
-\end{minted}
-%
-This term can then be pretty-printed into a string, and applied as a Scalafix patch.
-
-\subsection*{Summary}
+\paragraph{Extensibility and Safety}
+Further design considerations are made to ensure the extensibility of this approach: the \scala{Parser} trait is sealed, which enables compiler warnings if a new \scala{Parser} case is added and the \scala{transform} method is not updated.
+Although this formulation of the traversal is inspired by generic traversals, it still manually defines the traversal for each case: a safer approach would be to generically derive this.
+% parsley Haskell achieves this with cata
+In Scala, this would require the use of an external dependency such as \texttt{shapeless}\footnote{\url{https://github.com/milessabin/shapeless}},
+which is frankly overkill given the relative simplicity of the \scala{Parser} \textsc{adt}.
 
 \end{document}
diff --git a/src/body/leftrec.pdf b/src/body/leftrec.pdf
diff --git a/src/body/leftrec.tex b/src/body/leftrec.tex
@@ -226,8 +226,46 @@ \subsubsection{Building the Grammar Map}
 }.toMap
 \end{minted}
 
+\subsection{Lowering back to the Scalameta \textsc{ast}}
+After all necessary transformations have been applied to parser terms, the final step is to convert them back to a textual representation to be applied as a Scalafix patch.
+Parsers can be lowered back to \scala{scala.meta.Term} nodes by the inverse of the original \scala{fromTerm} transformation.
+The \scala{Parser} trait defines this transformation as the method \scala{term}, using quasiquotes to simplify the construction of the \scala{scala.meta.Term} nodes.
+\begin{minted}{scala}
+case class Zipped(func: Function, parsers: List[Parser]) extends Parser {
+  val term: Term = q"(..${parsers.map(_.term)}).zipped(${func.term})"
+}
+\end{minted}
+%
+This term can then be pretty-printed into a string, and applied as a Scalafix patch.
+
 \subsection{Implementing the Left-Recursion Transformation}
 \TODO{TODO}
 
+Running the transformation on the \scala{example} parser thus yields:
+\begin{figure}
+\begin{minted}{scala}
+def flip[A, B, C](f: A => B => C)(x: B)(y: A): C = f(y)(x)
+def compose[A, B, C](f: B => C)(g: A => B)(x: A): C = f(g(x))
+
+lazy val example: Parsley[String] = chain.postfix(
+  empty | (empty.map((_ + _).curried) | empty <*> example) <*> string("a")
+    | string("b") | empty
+)(
+  (empty.map(flip) <*> example | pure(identity).map(compose((_ + _).curried)))
+    .map(flip) <*> string("a")
+    | empty | empty
+)
+\end{minted}
+\caption{The initial attempt at factoring out left-recursion from the \scala{example} parser.}
+\label{fig:leftrec-example-bad}
+\end{figure}
+%
+Oh, dear.
+There are \emph{many} things wrong with the transformed output:
+\begin{itemize}
+  \item This output is horrendously complex and unreadable. The intent of the parser is entirely obfuscated in a sea of combinators.
+  \item Having to define the \scala{flip} and \scala{compose} functions is not ideal, but inlining them as lambdas would make the code even worse.
+  \item The parser does not even typecheck -- unlike classical Hindley-Milner-based type systems, Scala only supports local type inference~\cite{cremet_core_2006}. As a result, the compiler is unable to correctly infer correct types for \scala{flip} and also asks for explicit type annotations in the lambda \scala{(_ + _).curried}.
+\end{itemize}
 
 \end{document}