EOF start rules #3763

kaby76 · 2022-06-26T20:33:21Z

kaby76
Jun 26, 2022

There was an interesting bug found a week ago in the Scala grammar. The grammar results in a parser that apparently parses input fine, but yields an incomplete parse tree, and doesn't produce any error messages. This is because the parser stops on an error, and returns a parse tree up to the point where the parse worked.

Why does a parser do this? This is because the start rule does not end with the EOF symbol (which I call an "EOF start rule"). When the start rule is changed (compilationUnit : ('package' qualId)* topStatSeq ; => compilationUnit : (('package' qualId)* topStatSeq) EOF ;), an error is raised. The grammar has errors, which were not found because the parse was valid on partial input.

What other grammars in grammars-v4 do not have "an EOF start rule"? Most. When I add an EOF start rule, a dozen grammars that used to "work" no longer do.

agp
asm/asm8086
asn/asn
bcpl
databank
html
inf
lark
metamath

What is my point? Requiring explicit EOF start rules is a problem because developers will forget to add an EOF start rule. Adjusting the grammars in grammars-v4 is burdensome, as much as for "symbol conflicts" resolution and "case insensitive lexing" that were recently fixed in Antlr 4.10. Already, my PR for EOF start rules in grammars-v4 changes 158 files, and it still does not fix the dozen or so grammars with errors.

There are two solutions to force the parser to read to completion of the input:

Force the developer to include an EOF start rule. Update all grammars in the grammars-v4 repository to have an EOF-terminated start rule. I.e., if the original start rule given in pom.xml is start : s1 s2 ... sn ; then change the rule to start : (s1 s2 ... sn) EOF ;, or introduce a new rule that references the old start rule symbol, i.e. start : s1 s2 ... sn; => new_start : start EOF;

The problem with this method is that people sometimes start the parse on something other than the one listed in the pom.xml, yet have the same expectation of a parse that consumes the entire input. For example, parse expressions using the C++ grammar.

It's possible to use automated means (via Trash) to edit the grammar to have an EOF-terminated start rule. This may works for CI builds, but developers will still need to modify the grammar for their needs. Considering many are naive developers, they may not augment the grammar with an EOF start rule.

It's likely to test whether the grammar requires an EOF start rule if the RHS of the start rule can derive empty (start : s1 s2 ... sm sn so ... sz; and sm...sz =>+ empty).
Provide a new switch to the Antlr runtime for parsing two ways: one in which the input is consumed to the end of file; and a second type where it accepts a parse on partial input.

This modification would mean a change to the runtime. But, most developers get what they expect, with no modifications to the grammar.

Any thoughts?

mtoy-googly-moogly · 2022-06-27T16:13:50Z

mtoy-googly-moogly
Jun 27, 2022

Interesting, I also had this problem when writing tests. I just wanted, for example, to parse just an expression. I had to create a rule testableExpression: expression EOF to get those tests to work.

What if instead there was a flag when creating the parse tree calling parserObject.startRuleName() which would allow/dis-allow a parse to end with something other than EOF?

Oh, I see that was choice 2. Yeah, I like choice 2, and I think it should be per parse, not at parser generation time.

0 replies

KvanTTT · 2022-06-27T19:21:46Z

KvanTTT
Jun 27, 2022

Provide a new switch to the Antlr runtime for parsing two ways: one in which the input is consumed to the end of file; and a second type where it accepts a parse on partial input.

This modification would mean a change to the runtime. But, most developers get what they expect, with no modifications to the grammar.

I think this is definitely the correct solution to the problem. I encountered the problem some time ago when I had to test different rules instead of the root rule and I was forced to write a code for adding EOF tokens artificially. It was dirty. Also, I'd say "consuming up to EOF" mode should be default.

5 replies

kaby76 Jun 27, 2022
Author

Glad to hear. The question then becomes what should we do with the grammars in grammars-g4? Add EOF start rules to all grammars, add only to the ones that are failing and fix the grammars, or wait for a version of the runtime that supports this, or ... something else?

KvanTTT Jun 27, 2022

I think it depends on @parrt answer. If he accepts the runtime solution, we can wait for an implementation. For now, it makes sense to fix only failing grammars because it looks quite urgent.

By the way, it looks like with the new "consuming to the end" mode, there is no need in explicit EOF token at all, at least in grammars' repository.

kaby76 Jun 27, 2022
Author

Correct. I'll change only to fix parsing issues, but not include an EOF start rule on the fixed grammars in the PR. If/when the feature is implemented, I can edit the grammars in one easy foreach loop with Trash to remove the EOF if it's explicit.

kaby76 Jun 30, 2022
Author

If we do add the feature to the Antlr runtime, it would be good that the trees don't contain the EOF token. I noticed that manually adding the EOF to the end of the start rule causes the tree to contain the EOF leaf node. This then causes a bunch of remastering .tree files over in grammars-v4. Ideally, I'd rather not adjust the tree files.

KvanTTT Jun 30, 2022

It looks reasonable. Tree should not contain EOF token, because it's artificial token.

parrt · 2022-06-30T20:59:16Z

parrt
Jun 30, 2022
Maintainer

Hmm... I'm not liking these options I don't think. Also, I've often used the ability to parse the first part of a file with a rule. Adding EOF is just not a burden for me but apparently should be better documented haha!!

7 replies

parrt Jun 30, 2022
Maintainer

Why not make a Parser subclass that checks whether the input stream is at end of file for your grammar tests?

kaby76 Jun 30, 2022
Author

I thought of that, but it doesn't actually push the lexer and parser to output recognizer errors. It is a hack to force the developer to investigate further.

Incidentally, there are 24 or so grammars that are broken out of 340. So, that's only about 7% that are broken (where if forced to parse to EOF, errors would be output from lexer or parser in the grammar test suite). It's not as many as I thought, but I did have to change 217 files just to find the broken grammars, not to fix any of them mind you.

KvanTTT Jul 1, 2022

It's often useful to have such a mode. I worked at @Swiftify-Corp (ObjectiveC -> Swift code converter) some time ago, and I encountered a problem of using different rules as roots (translationUnit, expression, compoundStatement). It's a quite actual task of converting only part of code, not the entire file.

It's possible to check whether the input stream is at the end of a file, but it doesn't look a clear solution because the user expects parsing of entire code fragment even if it contains errors but not the small part of it. The only way to do this is to add the artificial EOF token, but the more natural way is to consider it runtime.

kaby76 Jul 1, 2022
Author

Just checking that the EOF is reached after a parse is not sufficient because one cannot distinguish between good and bad input:

Error parsing scala file of object definitions (object body was not parsed without any error messages) grammars-v4#2668 The input is invalid. The parser does not set an error, does out output an error, but it's not at the end of file.
grammar A; start: 'A' ; WS: [ ]+->skip; input 'A ': The input is valid, no error raised (and shouldn't), yet also not at the end of file.

It's absolutely not sufficient. But, it might be sufficient to say that if there are any tokens on the default channel after the parse tree interval, then there was likely an error.

I'll check a subclassed Parser or otherwise trgen method to push the parse to the EOF. But, I am busy correcting 24 messed up grammars.

parrt Jul 2, 2022
Maintainer

For better or worse, it was a conscious decision to allow parsing in the middle of a stream or as a prefix of a stream; I can't really change it now without breaking back or compatibility. I don't see it as an emergency in that it just really has not come up much over the past few decades. Naturally it is the most common case to parse the entire stream, so I just add EOF token. I wonder if that's in the book somewhere. Yep, an entire section called "Start Rules and EOF"; page 270 of pdf.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EOF start rules #3763

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 12 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

EOF start rules #3763

kaby76 Jun 26, 2022

Replies: 3 comments · 12 replies

mtoy-googly-moogly Jun 27, 2022

KvanTTT Jun 27, 2022

kaby76 Jun 27, 2022 Author

KvanTTT Jun 27, 2022

kaby76 Jun 27, 2022 Author

kaby76 Jun 30, 2022 Author

KvanTTT Jun 30, 2022

parrt Jun 30, 2022 Maintainer

parrt Jun 30, 2022 Maintainer

kaby76 Jun 30, 2022 Author

KvanTTT Jul 1, 2022

kaby76 Jul 1, 2022 Author

parrt Jul 2, 2022 Maintainer

kaby76
Jun 26, 2022

Replies: 3 comments 12 replies

mtoy-googly-moogly
Jun 27, 2022

KvanTTT
Jun 27, 2022

kaby76 Jun 27, 2022
Author

kaby76 Jun 27, 2022
Author

kaby76 Jun 30, 2022
Author

parrt
Jun 30, 2022
Maintainer

parrt Jun 30, 2022
Maintainer

kaby76 Jun 30, 2022
Author

kaby76 Jul 1, 2022
Author

parrt Jul 2, 2022
Maintainer