-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Negative lookahead with ordered choice will parse data that should be excluded #103
Comments
I should say that I'm not sure if the 'optional' operator in the Symbols rule could also be contributing. If you change the Symbols rule to
|
I appreciate the compact example of the problem. I'm investigating. Two observations so far: If you produce all the parses using Second, a far more natural way to express the Symbols rule is:
which works correctly (and would be a great workaround for you in the meantime while I investigate this further). So, my initial sense is that the recursion in Symbols is making it difficult for instaparse to sort out whether the inability to find a space character means to advance to the next item in the ordered choice, or proves the negative lookahead. Certainly the interplay of negative lookahead and ordered choice has been a tricky thing to get right. In certain recursive conditions, it can even be paradoxical to resolve, so likely that's the culprit -- instaparse thinks there's a paradox and is producing a couple possible parses to try to resolve it (but at first glance, I don't see an actual paradox here, so this may well be a bug). I don't think it is a problem with the optional operator. The reason that In any case, I wanted to make sure to respond to you quickly with the |
Ah, I just noticed that your grammar would be more equivalent to:
which does manifest the problem, so maybe it does have something to do with the optional nature after all. Will continue to look... |
OK, I understand what is going on now. It's not actually a bug, per se, but it is a subtle difference from the way Instaparse actually treats ordered choice, versus the way it is handled in PEG parsers, which makes it counterintuitive. Some background: So perhaps the best way to describe ordered choice in Instaparse is that when it encounters an ordered choice, it tries to find whether, globally, there is any possible parse consistent with the first alternative in ordered choice, before moving on to the next alternative in the ordered choice. All possible alternations are eventually produced, but the goal of ordered choice is to prioritize parses consistent with the first alternatives in the ordered list. So, effectively, what instaparse is doing is it is taking your grammar:
and is asking the question "Is there any valid parse for this?:"
for "foo,bar" and the answer is: "Yes, absolutely there is a valid parse for this grammar." namely Next, it asks whether there is a parse consistent with the following grammar:
and produces Further explorations of the ordered choice don't yield any further results for this input, so that is the entirety of what is produced. So instaparse is doing what I intended, producing all possible parses in an order that is consistent with the ordering given in the ordered choice. I can see that this is counterintuitive when ordered choice is used inside a negative lookahead like this. So I'll need to think about whether there is a way of modifying my approach to ordered choice which would bring it more in line with one's intuition in this context. If you have any suggestions, I'd certainly take your input into account. |
Clarification: Above, I said it is asking the question "Is there any valid parse for this?:"
which isn't strictly accurate, because this ordered choice alternation process for WS happens at each point in the string where it attempts to look for whitespace. So if there are two whitespace characters in the string, it might be trying the first one under the restricted assumption that whitespace is only |
I'm starting to wonder whether ordered choice really should have any special meaning at all inside a negative lookahead. Maybe the solution to making this more intuitive is that inside a negative lookahead, ordered choice simply becomes regular alternation. |
Certainly the workaround for now would be to have one whitespace rule with regular alternation to use inside of lookaheads, and one whitespace rule with ordered choice to be used where you need that preference in a non-lookahead context. |
As I'm looking through the negative-lookahead code and refreshing my memory, I'm remembering that I did in fact already adjust this logic to make ordered-choice-in-negative-lookahead behave as expected, but it's getting foiled by the dual-use of in both a regular and negative-lookahead context. |
In my effort to give a running commentary as I explored this, I gave a muddled impression of what is going on. You can safely ignore a lot of what I said. Here is a summary of the current state of things:
|
I was going to say that I've found rewriting the WS rule with plain alternation results in a correct parse, and that's probably fine since there's not really any ambiguity in the WS rule. I think I can work around this anyway, but it's probably still worth fixing. |
Agreed. Again, I appreciate your distilling it down to something that I could analyze and eventually get my head around what was going on. |
I believe I have come upon a bug in Instaparse. I have an example grammar at https://github.com/pjstadig/instaparse-ordered-choice
If I have a simple grammar like so:
and I parse the string "foo,bar" I end up with the parse
([:Symbol "f" "o" "o" "," "b" "a" "r"])
The Symbol rule should not be parsing the comma. Indeed, if I try to parse the string "foo," starting at the Symbol rule, it is rejected as invalid.Curiously, if I swap the first two clauses of the WS rule (the space and comma clauses), then "foo,bar" will be correctly parsed as
([:Symbol "f" "o" "o"] [:Symbol "b" "a" "r"]
however "foo bar" gets parsed as([:Symbol "f" "o" "o" " " "b" "a" "r'])
An additional curiosity is that with the original WS rule I get the following parses:
I'm open to the idea that I may be doing something horribly wrong, but this Instaparse behaviour seems quite unexpected.
The text was updated successfully, but these errors were encountered: