Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative lookahead with ordered choice will parse data that should be excluded #103

Open
pjstadig opened this issue Jun 9, 2015 · 11 comments
Assignees

Comments

@pjstadig
Copy link

pjstadig commented Jun 9, 2015

I believe I have come upon a bug in Instaparse. I have an example grammar at https://github.com/pjstadig/instaparse-ordered-choice

If I have a simple grammar like so:

<Symbols> := Symbol (WS+ Symbols)?

Symbol := NonDelim+

<NonDelim> := !WS #'.'

<WS> := <' '> / <','> / <'\t'> / <'\r'> / <'\n'>

and I parse the string "foo,bar" I end up with the parse ([:Symbol "f" "o" "o" "," "b" "a" "r"]) The Symbol rule should not be parsing the comma. Indeed, if I try to parse the string "foo," starting at the Symbol rule, it is rejected as invalid.

Curiously, if I swap the first two clauses of the WS rule (the space and comma clauses), then "foo,bar" will be correctly parsed as ([:Symbol "f" "o" "o"] [:Symbol "b" "a" "r"] however "foo bar" gets parsed as ([:Symbol "f" "o" "o" " " "b" "a" "r'])

An additional curiosity is that with the original WS rule I get the following parses:

"foo bar" => ([:Symbol "f" "o" "o"] [:Symbol "b" "a" "r"])
"foo,bar" => ([:Symbol "f" "o" "o" "," "b" "a" "r"])
"foo\tbar" => ([:Symbol "f" "o" "o" "\t" "b" "a" "r"])
"foo\rbar" => ([:Symbol "f" "o" "o"] [:Symbol "b" "a" "r"])
"foo\nbar" => ([:Symbol "f" "o" "o"] [:Symbol "b" "a" "r"])

I'm open to the idea that I may be doing something horribly wrong, but this Instaparse behaviour seems quite unexpected.

@pjstadig
Copy link
Author

pjstadig commented Jun 9, 2015

I should say that I'm not sure if the 'optional' operator in the Symbols rule could also be contributing. If you change the Symbols rule to Symbol WS+ Symbol then everything parses as expected:

"foo bar" => ([:Symbol "f" "o" "o"] [:Symbol "b" "a" "r"])
"foo,bar" => ([:Symbol "f" "o" "o"] [:Symbol "b" "a" "r"])
"foo\tbar" => ([:Symbol "f" "o" "o"] [:Symbol "b" "a" "r"])
"foo\rbar" => ([:Symbol "f" "o" "o"] [:Symbol "b" "a" "r"])
"foo\nbar" => ([:Symbol "f" "o" "o"] [:Symbol "b" "a" "r"])

@Engelberg
Copy link
Owner

I appreciate the compact example of the problem. I'm investigating.

Two observations so far:

If you produce all the parses using insta/parses, the correct parse is there, it is just the second parse. So that is encouraging -- now the question is why is it producing the seemingly bogus parse first.

Second, a far more natural way to express the Symbols rule is:

<Symbols> := Symbol (WS+ Symbol)+

which works correctly (and would be a great workaround for you in the meantime while I investigate this further).

So, my initial sense is that the recursion in Symbols is making it difficult for instaparse to sort out whether the inability to find a space character means to advance to the next item in the ordered choice, or proves the negative lookahead. Certainly the interplay of negative lookahead and ordered choice has been a tricky thing to get right. In certain recursive conditions, it can even be paradoxical to resolve, so likely that's the culprit -- instaparse thinks there's a paradox and is producing a couple possible parses to try to resolve it (but at first glance, I don't see an actual paradox here, so this may well be a bug). I don't think it is a problem with the optional operator. The reason that \r and \n aren't causing the same problems is that in Java/Clojure regexes, #'.' doesn't match newline characters, so those can never be classified as a NonDelim.

In any case, I wanted to make sure to respond to you quickly with the <Symbols> := Symbol (WS+ Symbol)+ workaround, and I'll continue to look at this. Another possible workaround would be to use the regular alternation operator for the negative lookahead whitespace, and if you need ordered choice elsewhere for a positive rule, you could create a separate rule for that.

@Engelberg
Copy link
Owner

Ah, I just noticed that your grammar would be more equivalent to:

<Symbols> := Symbol (WS+ Symbol)*

which does manifest the problem, so maybe it does have something to do with the optional nature after all. Will continue to look...

@Engelberg
Copy link
Owner

OK, I understand what is going on now. It's not actually a bug, per se, but it is a subtle difference from the way Instaparse actually treats ordered choice, versus the way it is handled in PEG parsers, which makes it counterintuitive.

Some background:
In PEG parsers, there is a very strict evaluation order (which is why PEG parsers can't handle left recursion). When creating Instaparse, I was faced with the challenge of how to interpret ordered choice in this more forgiving context handling all sorts of recursions and ambiguity.

So perhaps the best way to describe ordered choice in Instaparse is that when it encounters an ordered choice, it tries to find whether, globally, there is any possible parse consistent with the first alternative in ordered choice, before moving on to the next alternative in the ordered choice. All possible alternations are eventually produced, but the goal of ordered choice is to prioritize parses consistent with the first alternatives in the ordered list.

So, effectively, what instaparse is doing is it is taking your grammar:

<Symbols> := Symbol (WS+ Symbols)?
Symbol := NonDelim+
<NonDelim> := !WS #'.'
<WS> := <' '> / <','> / <'\t'> / <'\r'> / <'\n'>

and is asking the question "Is there any valid parse for this?:"

<Symbols> := Symbol (WS+ Symbols)?
Symbol := NonDelim+
<NonDelim> := !WS #'.'
<WS> := <' '>

for "foo,bar"

and the answer is: "Yes, absolutely there is a valid parse for this grammar." namely ([:Symbol "f" "o" "o" "," "b" "a" "r"])

Next, it asks whether there is a parse consistent with the following grammar:

<Symbols> := Symbol (WS+ Symbols)?
Symbol := NonDelim+
<NonDelim> := !WS #'.'
<WS> := <' '> | <','>

and produces ([:Symbol "f" "o" "o"] [:Symbol "b" "a" "r"])

Further explorations of the ordered choice don't yield any further results for this input, so that is the entirety of what is produced.

So instaparse is doing what I intended, producing all possible parses in an order that is consistent with the ordering given in the ordered choice.

I can see that this is counterintuitive when ordered choice is used inside a negative lookahead like this. So I'll need to think about whether there is a way of modifying my approach to ordered choice which would bring it more in line with one's intuition in this context. If you have any suggestions, I'd certainly take your input into account.

@Engelberg
Copy link
Owner

Clarification:

Above, I said it is asking the question "Is there any valid parse for this?:"

<Symbols> := Symbol (WS+ Symbols)?
Symbol := NonDelim+
<NonDelim> := !WS #'.'
<WS> := <' '>

which isn't strictly accurate, because this ordered choice alternation process for WS happens at each point in the string where it attempts to look for whitespace. So if there are two whitespace characters in the string, it might be trying the first one under the restricted assumption that whitespace is only ' ' and at the next point in the string, it has already moved on to treating it as ' ' | ','. So I was giving a simplification that illustrates what is going on in the context of a single whitespace character.

@Engelberg
Copy link
Owner

I'm starting to wonder whether ordered choice really should have any special meaning at all inside a negative lookahead. Maybe the solution to making this more intuitive is that inside a negative lookahead, ordered choice simply becomes regular alternation.

@Engelberg
Copy link
Owner

Certainly the workaround for now would be to have one whitespace rule with regular alternation to use inside of lookaheads, and one whitespace rule with ordered choice to be used where you need that preference in a non-lookahead context.

@Engelberg
Copy link
Owner

As I'm looking through the negative-lookahead code and refreshing my memory, I'm remembering that I did in fact already adjust this logic to make ordered-choice-in-negative-lookahead behave as expected, but it's getting foiled by the dual-use of in both a regular and negative-lookahead context.

@Engelberg
Copy link
Owner

In my effort to give a running commentary as I explored this, I gave a muddled impression of what is going on. You can safely ignore a lot of what I said. Here is a summary of the current state of things:

  1. The stuff I said about how ordered choice behaves is sort-of-true, but the real behavior (especially in negative lookahead contexts) is more subtle than that and is meant to behave intuitively.
  2. Therefore, this is a bug.
  3. The true root cause of the bug has to do with the way I'm trying to leverage positive results already computed when doing lookahead. The !WS is reporting the negative lookahead is satisfied because it is using the results of the non-lookahead use of WS, which is employing its ordered-choice-limited-view that is initially only seeing the first alternative.
  4. I will fix it, but I need a little hammock time to figure out the best way to fix it with minimal performance impact.

@pjstadig
Copy link
Author

pjstadig commented Jun 9, 2015

I was going to say that I've found rewriting the WS rule with plain alternation results in a correct parse, and that's probably fine since there's not really any ambiguity in the WS rule.

I think I can work around this anyway, but it's probably still worth fixing.

@Engelberg
Copy link
Owner

Agreed. Again, I appreciate your distilling it down to something that I could analyze and eventually get my head around what was going on.

@Engelberg Engelberg self-assigned this May 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants