-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathparsers.tex
324 lines (260 loc) · 11.1 KB
/
parsers.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
\chapter{Parsers}\label{parsers}
I was asked on IRC how to use parser combinators to write a simple
translator from s-expressions to Factor quotations. The translation
was quite simple - here are some examples of how it was supposed to
work:
\begin{verbatim}
(set a 10)
=> [ "set" "a" 10 ]
(set square (lambda (n) (* n n)))
=> [ "set" "square" [ "lambda" [ "n" ] [ "*" "n" "n" ] ] ]
\end{verbatim}
This looked like it should be fairly simple so I put together a quick
implementation.
The parser combinators were recently rewritten so this code is likely
to work best with the Darcs version of Factor. Instructions on how to
get this going are at the Factor website. Basic instructions for an
Intel Linux based machine would go something like:
\begin{verbatim}
git clone http://www.factorcode.org/
cd factor
make linux-x86-32
wget http://factorcode.org/images/latest/boot.x86.32.image
./factor -i=boot.x86.32.image
./factor
\end{verbatim}
To run the examples from within Factor you will need to load the
parser-combinators module and use it:
\begin{verbatim}
USE: parser-combinators
USE: lazy-lists
\end{verbatim}
A parser combinator is a tuple that has specialised the `parse'
generic word. This word has stack effect ( input parser -- result
). The input is a sequence of tokens (usually a string) and the
result is what is known as a lazy list of successes.
A `list of sucesses' is a list of all possible sucessful parse
results. It being lazy, each result in the list is not computed
(ie. the parse is not done) until that list element is
requested. This is how parser combinators handle backtracking. If the
first parse result in the list is not what is required, the second
can be tried, etc.
Parsers can be written manually or existing parsers can be combined
together using combinators. A number of existing parsers and
combinators are provided in the `parser-combinators' vocabulary to
build simple parsers.
To start with the s-expression parser we can handle the `(' and `)'
using the `token' word. This word returns a parser that parses only
the given string:
\begin{verbatim}
LAZY: 'lparens' ( -- parser )
"(" token ;
LAZY: 'rparens' ( -- parser )
")" token ;
\end{verbatim}
Notice these are defined with the `LAZY:' word instead of `:'. This
means the result of the word is a promise and needs to be forced to
get the actual value. Although not strictly necessary with these
simple parsers it is required for those that recursively call
themselves so I generally always define parsers using it. Examples of
usage:
\begin{verbatim}
"(" 'lparens' parse car . => T{ parse-result f "(" T{ slice f "(" 1 1 } }
\end{verbatim}
The result of the parse is a lazy list. Returning the `car' of that
list shows the first parse result. The `parse-result' tuple has two
slots. The first slot is the parse tree returned by the parser. The
second is the rest of the input string remaining to be parsed if
any. Usually the latter is a slice for efficiency purposes. The
parsers created with `token' return the token itself as the parse
tree.
The s-expression parser needs to parse numbers and
identifiers. Numbers are digits repeated multiple times. A parser for
digits could be:
\begin{verbatim}
LAZY: 'digit' ( -- parser )
[ digit? ] satisfy ;
\end{verbatim}
This uses the `satisfy' parser generator word. Where `token' parses an
input string for a specified string, `satisfy' will call a given
quotation with the first character of the input string. If the
quotation returns `true' then the parse is succesful otherwise it
fails. Examples of using this definition of `digit' are:
\begin{verbatim}
"123" 'digit' parse car parse-result-parsed . => 49
\end{verbatim}
The `parse-result-parsed' word returns just the parse tree of the
result from the `parse-result' tuple. The result, `49', is the
character code of the successfully parsed digit. Ideally I want the
actual number returned rather than the character code.
The `$<$@' word applies a transformation to the parse tree of a
parser. It converts an existing parser to one that does what the
original parser did but has the transformation applied. `$<$@' takes a
quotation on the stack with stack effect ( old-tree -- new-tree
). This quotation is called to perform the transformation. One way to
remember the effect of `$<$@' is to note that `$<$' sign points to the
parser being transformed. Here is `digit' converted to return a
number:
\begin{verbatim}
LAZY: 'digit' ( -- parser )
[ digit? ] satisfy [ digit> ] <@ ;
"123" 'digit' parse car parse-result-parsed . => 1
\end{verbatim}
`digit$>$' is a standard Factor word that coverts the character code of
a digit into the digit itself. A number is one or more digits in a
row. The $<$+$>$ word takes a parser as input and returns a parser that
parses one or more instances of the original. It has the same meaning
as `+' in regular expressions. This allows a `number' parser word to
be written as:
\begin{verbatim}
LAZY: 'number' ( -- parser )
'digit' <+> ;
"123" 'number' parse car parse-result-parsed . => { 1 2 3 }
\end{verbatim}
Notice here that the result is an array of the digits in the
number. This is the parse tree that $<$+$>$ generates - an array of the
results of the original parser. By using `$<$@' this can be converted
into a numeric value using Factor's `reduce' word (basically a left
fold):
\begin{verbatim}
LAZY: 'number' ( -- parser )
'digit' <+> [ 0 [ swap 10 * + ] reduce ] <@ ;
"123" 'number' parse car parse-result-parsed . => 123
\end{verbatim}
Interestingly `number' actually returns more than one parse result
since "123" contains more than one occurrence of digits in a sequence:
\begin{verbatim}
"123" 'number' parse [ parse-result-parsed ] lmap [ . ] leach
=> 123
12
1
\end{verbatim}
It returns the longest match first which is what we usually want. As
the list is lazy if we don't request anything beyond the first match
then the remaining parse results aren't actually computed.
Next on the list is identifiers. These are any sequences of characters
that are not numbers, spaces, or parenthesis. The `satisfy' word can
be used for this, combined with $<$+$>$:
\begin{verbatim}
LAZY: 'identifier' ( -- parser )
[
[ blank? not ] keep
[ digit? not ] keep
[ CHAR: ( = not ] keep
CHAR: ) = not
and and and
] satisfy <+> [ >string ] <@ ;
"foo" 'identifier' parse car parse-result-parsed . => "foo"
\end{verbatim}
The `$<$@' word is used to transform the result into a string otherwise
we'd get an array of the character codes in the identifier as a
result.
Often it is desired to parse either a number or an identifier. The
`atom' word does this:
\begin{verbatim}
LAZY: 'atom' ( -- parser )
'identifier' 'number' <|> ;
"123" 'atom' parse car parse-result-parsed . => 123
"foo" 'atom' parse car parse-result-parsed . => "foo"
\end{verbatim}
It uses `$<|>$', known as the choice word. It takes two parsers on the
stack, and generates a parser which when run will try the first parser
on the input string. If it fails it will then try the second
parser. It returns the list of successes resulting from both parses.
A first cut at parsing a single expression like `(set a 10)' requires
a sequencing parser combinator. This is `$<$\&$>$'. It takes two parsers
and returns a resulting parser which when run will run the first
parser on the input string, and then run the second parser on the
remaining unparsed portion of the input string of each result from the
first parser. A simple expression parser might look like:
\begin{verbatim}
LAZY: 'expr1' ( -- parser )
'lparens'
'atom' <&>
'rparens' <&> ;
"(123)" 'expr1' parse car parse-result-parsed . =>
{ { "(" 123 } ")" }
\end{verbatim}
This results in a very ugly parse tree. The `$<$\&$>$' word returns a parse
tree which is an array of the two parsers it uses. Since we have two
`$<$\&$>$' calls we get nested results. This can be fixed using two
variants of `$<$\&$>$'. They are `$<$\&' and `\&$>$'. These words do the same as
`$<$\&$>$' but don't put the result of one of the parsers in the parse
tree. The `$>$' or `$<$' point to the parser that will have the result
returned. Here's a new version that has a nicer parse tree using these
words:
\begin{verbatim}
LAZY: 'expr2' ( -- parser )
'lparens'
'atom' &>
'rparens' <& ;
"(123)" 'expr2' parse car parse-result-parsed . => 123
\end{verbatim}
Notice the variants of `$<$\&$>$' used both select the result of the `atom'
parser to be included in the parse tree and not the results of the
parenthesis parsers.
There is still a problem with the expression parser. It doesn't handle
more than one `atom' in the expression. Similar to `$<$+$>$' there is
`$<$*$>$' which takes a parser and returns a new parser which when called
parses zero or more instances of the original parser. The parse tree
for the result of `$<$*$>$' is an array of the results of the original
parser:
\begin{verbatim}
LAZY: 'expr3' ( -- parser )
'lparens'
'atom' sp <*> &>
'rparens' <& ;
"(set a 123)" 'expr3' parse car parse-result-parsed . =>
{ "set" "a" 123 }
\end{verbatim}
This has one other change in that it needs to handle white space
between atoms. The `sp' word takes a parser (in this case `atom') and
returns a parser that first removes all white space from the start of
the input string and then calls the original parser. The effect is to
produce a parser that ignores white space.
This gets close to what out test case requires but it returns an array
not a quotation. Using `$<$@' fixes this:
\begin{verbatim}
LAZY: 'expr4' ( -- parser )
'lparens'
'atom' sp <*> &>
'rparens' <& [ >quotation ] <@ ;
"(set a 123)" 'expr4' parse car parse-result-parsed . =>
[ "set" "a" 123 ]
\end{verbatim}
Unfortunately this fails on our second test case which requires
handling nested expressions like `(+ 1 (+ 2 3) 4)'. By recursively
calling the `expr' word this is easy to add:
\begin{verbatim}
LAZY: 'expr5' ( -- parser )
'lparens'
'atom' 'expr5' <|> sp <*> &>
'rparens' <& [ >quotation ] <@ ;
"(+ 10 (+ 20 30))" 'expr5' parse car parse-result-parsed . =>
[ "+" 10 [ "+" 20 30 ] ]
\end{verbatim}
This change was as simple as replacing the `atom' parser used between
the two parenthesis with a parser which checks for an atom or an
expression. Note that this recursively calls itself, which is safe
from infinite recursion due to the lazy evaluation. It doesn't
actually get evaluated unless the `atom' test fails first. One final
change is to allow for only atoms as well as expressions and then our
simple test cases work and it is completed:
\begin{verbatim}
LAZY: 'expr' ( -- parser )
'atom'
'lparens'
'atom' 'expr' <|> sp <*> &>
'rparens' <& [ >quotation ] <@ <|> ;
"(set square (lambda (n) (* n n)))" 'expr' parse car parse-result-parsed . =>
[ "set" "square" [ "lambda" [ "n" ] [ "*" "n" "n" ] ] ]
"123" 'expr' parse car parse-result-parsed . =>
123
"hello" 'expr' parse car parse-result-parsed . =>
"hello"
\end{verbatim}
These examples should work using the parser combinators code in the
current Factor darcs repository and in the upcoming 0.85 release. I'm
currently writing more documentation for the parser combinators and it
will be accessable using the standard Factor help system. The source
for this example can be found here.