TODO merge in notes from CS164

TODO: distinguish grammars from parsers more cleanly in these notes
compiler types
- recursive descent
  - parses LL(k)
  - simplest to hand-code; simply recursive functions that backtrack
  - non-predictve; requires backtracking
  - most embedded parser languages are this (e.g. haskell's parsec monadic parser), unless the embedded language constructs the data structures fed into a "more separate" parsing
- predictive grammars
  - use $k$ tokens of lookahead to fully determine which production to use
  - no backtracking needed
- top-down operator precedence (pratt) parser
  - http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
  - basic idea: allow yourself to call into main parse(precedence) from anywhere, and the precedence is always checked to be non-decreasing
  - also controls associativity
  - pseudocode
    
    def parse(precedence): token = next() p = prefix_parsers[token.type] parsed = p.parse(token) while True: token = next() p = infix_parsers[token.type] if p is None or precedence < token.precedence: break parsed = p.parse(parsed, token) return parsed
parsing types
- shift-reduce parsing: a type of bottom-up parsing
grammar types (hierarchy)
- bottom-up
  - LR
    - LR(0)
    - SLR(1!)
    - LALR(1)
    - LR(1)
    - LR(n)
grammar types (details)
- general
  - first L means left-to-right scanning
  - second L/R means left-to-right/right-to-left derivation (leftmost/rightmost)
  - (k) = k-token lookahead
- LL(k) aka top-down: simplest
  - produces leftmost derivation top-down
  - LL(1): need to look ahead 1 token to determine which rule to apply (for infix, etc.)
  - animation: http://www.cs.wm.edu/~noonan/animations/llparse.html
  - ANTLR
  - LL is a subset of LR
- non-commutative LL(*): most hand-written parsers, combinators
  - memoized non-commutative LL(*): packrat parsers, scala 2.8 parser combinators
  - harder theoretical analysis vs. LR, but probably weaker
- LR(k)
  - k is the number of tokens to lookahead at the end of a rule
  - produces rightmost derivation in reverse bottom-up TODO
  - don't get used much; internal tables tend to get too big
  - handles left-recursion
- LALR(k)
  - YACC, bison
- SLR(k)
  - LR(0) + followsets TODO
grammar forms
- Backus-Naur Form (BNF): TODO
- Extended BNF (EBNF): TODO
determining ambiguity: uncomputable
- eg "dangling else" in C/C++/Java; convention is attach to nearest
  - if SLR/LR(1)/LALR LR, rely on generated parser feature of preferring shift over reduce if conflict
- LR (and thus LL) grammars are always non-ambiguous
- unsure: LR parsers can identify certain non-ambiguous grammars, but they have this gray area where they can't tell whether the grammar is ambiguous
- but at least GLR can notify you of this when it happens
parsing expression grammar (PEG)
- recursive descent (top-down) parsers: no left recursion
- packrat parsers
  - a memoizing parsers; use much more space TODO how much?
  - linear time (unlike GLR)
  - rewriting extension can handle left recursion
- no ambiguity because alternative productions are ordered
  - that's also a downside: can't tell when you have ambiguous grammar
- all LL grammars are PEGs but not vice versa
code generation
- division is slow integer operation
  - gcc (among others) uses trickery involving non-div instrs, but only works for constant divisors
  - http://research.swtch.com/2008/01/division-via-multiplication.html
superoptimizers: exhaustively search for truly optimal generated code sequence
language classes: low to high power
- regular: regexes; DFAs/NFAs
- context-free: don't need symbol table
  - python is context-free but req's tokenizer that recognizes (de-)indents
- context-sensitive: need symbol table; eg in C++, is T*x a type decl or a mult? ("most vexing parse")

Tools

parser generators
- lex/yacc: old and busted; LALR
- flex/bison: new hotness; LALR
  - generates C
  - does GLR
  - flex includes flex++ for C++; bison++ also exists
- javacc: most popular for java
- alex/happy: haskell
- ANTLR: LL(k) parser generator
edsl parsers
- parsec: haskell
- scala
boost spirit 2: GLR

Grammars

terminals: loosely, can be used interchangeably with "token"
non-terminals
start symbol
productions

Derivations

sequence of production steps proving that a string belongs in the language
2 choices at each step:
- which nonterminal to expand? leftmost vs. rightmost
- which production to use for that nonterminal?
e.g.
- grammar: E -> (E) | -E | id | E+E
- a derivation: E => -E => -(E) => -(id)
- leftmost derivation: E => E+E => id+E => id+id
- rightmost derivation: E => E+E => E+id => id+id

Grammar issues

ambiguity
left recursion
- hidden left recursion: even tomita's GLR can't handle

Grammar transformations

eliminating ambiguity
eliminating left recursion
left factoring: gets you a grammar suitable for top-down parsing

Top-down parsing

recursive descent: requires backtracking
LL: predictive, with $k$ tokens of lookahead
- tables of FIRST and FOLLOW sets
top-down parsing during left-to-right scan constructs a leftmost derivation

Bottom-up parsing

shift-reduce parsers
recursive-ascent: implementation technique for bottom-up (LR) parsers
- practical benefit: flexible in that it allows for parameterized recursion (eg for error recovery)
- also, popular bc simple
LR: largest class of grammars for which shift-reduce parser can be built
- shift-reduce conflicts; ambiguous grammar cannot be LR
- LR parsing is attractive because
  - it can recognize almost all CFGs
  - LR parser can detect a syntactic error ASAP on left-to-right scan
  - most general nonbacktracking shift-reduce parsing method known, but efficient
  - main drawback is difficulty of implementation
bottom-up parsing during left-to-right scan constructs a rightmost derivation in reverse
LALR: lookahead LR
- more compact than LR, like SLR
- more general than SLR
Excerpts from http://www.devincook.com/GOLDParser/articles/glossary.htm
- LALR Parsing, or “Lookahead LR parsing”, is a variant of LR Parsing that combines different “configurations” to limit the size of the parse tables. As a result, the algorithm is slightly less powerful than the LR Parsing. Grammars that can be parsed by a LR parser, might be found to be “ambiguous” by LALR. However, this is very rarely the case and real-world examples are few.The number of states eliminated by LALR are sometimes huge. The C programming language, for instance, has over 10,000 LR states. LALR drops this number to slightly more than 200.
- LL Parsing, or Left-to-right Left-derivative parsing, refers to a class of parsers that analyze text using a top-down algorithm.
  
  LL Parsers have the advantage of being very simple in the design - at least conceptually. Unlike LR Parsers, the system does not need to generate tables ahead of time. Instead, only a set of rules and lookahead data is needed.
  
  However, LL Parsers are not as powerful as LR parsers. Rules cannot have left-recursion since it causes problems with the recursive decent algorithm. In addition, LL Parsers are known as memory intensive and slow. Both of these are due to the "search" nature of the algorithm. Optimization is possible, but the complexity of the LL Parser runtime engine increases dramatically.
- LR Parsing, or Left-to-right Right-derivative parsing, uses tables to determine when a rule is complete and when additional tokens are needed to be read from the source. Unlike LL Parsing, the LR Parser does very little "thinking" at runtime. All decisions are based on the content of the parse tables. As a result, LR Parsing is faster and more memory efficient than LL Parsing.
  
  The construction of these tables where all the "thinking" takes place. LR parser generators, such as YACC and GOLD, construct these tables by analyzing the grammar and determining all the possible "states" the system can be in when parsing.
  
  Each state represents a point in the parse process where a number of tokens have been read from the source and rules are in different states of completion. Each production in a state of completion is called a "configuration" and each state is really a configuration set.
  
  LR parse tables can be huge, and, as a result, often a variant of LR Parsing is used. For more information, please see theLALR Algorithm page.
- A Reduce-Reduce Conflict is a caused when a grammar allows two or more different rules to be reduced at the same time, for the same token. When this happens, the grammar becomes ambiguous since a program can be interpreted more than one way.
  
  For instance, assume you have the following grammar: ~~::= |~~
  ::= Identifier ::= Identifier
  When the system reads an identifier, it cannot determine if it has completed or.
  
  When a LALR parser generator is analyzing a grammar and constructing the parse tables, these conflicts are located immediately.

compiler optimizations

http://lambda-the-ultimate.org/node/3674

const objects are treated as constants (vs const refs where constness can be cast away)

macros/inline unnecessary; compiler tries to inline anyway

reduce conditionals; change to arithmetic

in-CPU call stack makes regular function calls fast too

common subexpression elimination

invariant hoisting moves loop-invariant checks out of the loop

inlining lets compiler do variable value range analysis

auto vectorization: words at a time or 128-bit vector at a time

bit shifts and leas

tail recursion (gcc)

dead code elimination

things that don't matter: pre-/post-inc, data structures/algos (vs. simple cache-friendly arrays)

memory hierarchy is the only important optimization today

mult instead of shift: 5 cycles

branch mispredict: 10 cycles

cache miss to main memory: 250 cycles

lexing/regular expression matching

2 approaches: backtracking and parallel

backtracking: depth-first search; used in Perl, Python, etc

sucks for a?{n}a{n} since it may need to redo the a?{n} many times

may take exp-time

perl uses memoization to avoid exp'l blow-up, but requires $O(nm)$ space, where $n$ is len of regex and $m$ is len of str

parallel aka Thompson NFA: breadth-tracking search; track all possible states simultaneously

when building DFAs out of NFAs, this is the approach used

sed, awk, grep, tcl, lexers build DFAs

sucks (exp-time) for wide-branching cases? can't think of them right now

backreferences trivial in backtracking, but make parallel exp-time

http://swtch.com/~rsc/regexp/regexp1.html

lexer hack: lexer doesn't have enough info to determine if symbol is var name or type name, eg (int) *x vs (a) * b; must feed it symbol table info

tracing JITs

tracing vs per-method JIT

tracing: tracemonkey

method: nitro (safari), v8 (but v8 guesses by looking at src)

tracing faster on numeric but slower on object-manip

http://lambda-the-ultimate.org/node/3851

type-specializing vs non-type-specializing JIT

not yet any real per-method type-specializing JITs

hard to fully specialize bc these only check arg types, making input-specialized versions, but not dealing w internal types (eg getting obj props)

tracemonkey: only loops that iterate at least 3x

type specialization is biggest fruit; interp is less

branch specialization: cap at 32 traces to avoid using too much mem

nanojit: the actual compiler

22x on numerical; 2-5x on object manip

stages: monitoring (loop counting), recording, compiling, native execution

inline caching

speed up dynamic method invocation by caching previous method binding directly at call site

type check ('guard') still required, usu in callee's preamble

monomorphic: one type cached; else, revert to slow path

polymorphic: which of a limited set of methods to invoke; multiple bindings can be recorded at same call site

megamorphic: if see too many types; same as uninitialized but never mono-/polymorphic again

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compilers.page

Compilers.page

Files

Compilers.page

Latest commit

History

Compilers.page

File metadata and controls