diff --git a/parser-ng/src/parser/mod.rs b/parser-ng/src/parser/mod.rs index ca8b3e7d7..a5b8d89ee 100644 --- a/parser-ng/src/parser/mod.rs +++ b/parser-ng/src/parser/mod.rs @@ -61,12 +61,83 @@ impl<'src> Parser<'src> { /// Internal implementation of the parser. The [`Parser`] type is only a /// wrapper around this type. struct InternalParser<'src> { + /// Stream from where the parser consumes the input tokens. tokens: TokenStream<'src>, + + /// Stream where the parser puts the events that conform the resulting CST. output: SyntaxStream, + + /// If true, the parser is "failure" state. The parser enters the "failure" + /// state when some syntax rule expects a token that doesn't match the + /// next token in the input. + failed: bool, + + /// How deep is the parser into "optional" branches of the grammar. An + /// optional branch is one that can fail without the whole production + /// rule failing. For instance, in `A := B? C` the parser can fail while + /// parsing `B`, but this failure is acceptable because `B` is optional. + /// Less obvious cases of optional branches are present in alternatives + /// and the "zero or more" operation (examples: `(A|B)`, `A*`). + opt_depth: usize, + + /// Errors found during parsing that haven't been sent to the `output` + /// stream yet. + /// + /// When the parser expects a token, and that tokens is not the next one + /// in input, it produces an error like `expecting "foo", found "bar"`. + /// However, these errors are not sent immediately to the `output` stream + /// because some the errors may occur while parsing optional code, or while + /// parsing some branch in an alternation. For instance, in the grammar + /// rule `A := (B | C)`, if the parser finds an error while parsing `B`, + /// but `C` succeeds, then `A` is successful and the error found while + /// parsing `B` is not reported. + /// + /// In the other hand, if both `B` and `C` produce errors, then `A` has + /// failed, but only one of the two errors is reported. The error that + /// gets reported is the one that advanced more in the source code (i.e: + /// the one with the largest span start). This approach tends to produce + /// more meaningful errors. + /// + /// The items in the vector error messages accompanied by the span in the + /// source code where the error occurred. pending_errors: Vec<(String, Span)>, + + /// Hash map where keys are positions within the source code, and values + /// are a list of tokens that were expected to match at that position. + /// + /// This hash map plays a crucial role in error reporting during parsing. + /// Consider the following grammar rule: + /// + /// `A := a? b` + /// + /// Here, the optional token `a` must be followed by the token `b`. This + /// can be represented (conceptually, not actual code) as: + /// + /// ```text + /// self.start(A) + /// .opt(|p| p.expect(a)) + /// .expect(b) + /// .end() + /// ``` + /// + /// If we attempt to parse the sequence `cb`, it will fail at `c` because + /// the rule matches only `ab` and `b`. The error message should be: + /// + /// "expecting `a` or `b`, found `c`" + /// + /// This error is generated by the `expect(b)` statement. However, the + /// `expect` function only knows about the `b` token. So, how do we know + /// that both `a` and `b` are valid tokens at the position where `c` was + /// found? + /// + /// This is where the `expected_tokens` hash map comes into play. We know + /// that `a` is also a valid alternative because the `expect(a)` inside the + /// `opt` was tried and failed. The parser doesn't fail at that point + /// because `a` is optional, but it records that `a` was expected at the + /// position of `c`. When `expect(b)` fails later, the parser looks up + /// any other token (besides `b`) that were expected to match at the + /// position and produces a comprehensive error message. expected_tokens: HashMap>, - opt_depth: usize, - failed: bool, } impl<'src> From> for InternalParser<'src> {