Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EBNF exception symbol support #114

Open
eerohele opened this issue Nov 5, 2015 · 5 comments
Open

EBNF exception symbol support #114

eerohele opened this issue Nov 5, 2015 · 5 comments

Comments

@eerohele
Copy link

eerohele commented Nov 5, 2015

I'd like to use Instaparse to parse XPath expressions. I have an EBNF grammar that works otherwise (well, I think so, at least), but there are two rules that don't work:

NCName          ::=     Name - (Char* ':' Char*)    /* An XML Name, minus the ":" */

And:

CommentContents ::=     (Char+ - (Char* ('(:' | ':)') Char*))

Where Char is:

Char            ::=     #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]  /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

The rule for Name is a bit longer so I won't copy-paste it here, but it's available here.

When parsing my EBNF file, Instaparse throws this error:

java.lang.RuntimeException: - occurs on the right-hand side of your grammar, but not on the left

If I understand correctly, Instaparse doesn't support the EBNF exception symbol. If so, are there any plans to support it, or is my best bet to try to rewrite those rules using regular expressions? I'm just asking because the rules for Char and Name are pretty hefty, so I'm not sure what the best approach here is.

@Engelberg
Copy link
Owner

There are a few different versions of the EBNF standard floating around, and whatever version I originally consulted didn't have a reference to the exception symbol, so this is the first I'm hearing about it. I just looked it up, though, so I know what you're talking about.

It is certainly a goal of instaparse to make it possible to just paste in standard EBNF grammars with little to no modification, so now that I know about it, I'd like to eventually investigate this and get it included. It appears that the standard severely restricts what can come after the - symbol. To avoid problematic recursion, it looks like the spec says the right-hand side needs to expand to something simple (like an alternation of plain symbols), so I'm not sure your example for NCName and CommentContents would even fall within the scope of the spec's definition of the exception symbol.

In the meantime, it seems to me that negative lookahead should be a viable substitute for the exception symbol. Simply translate A - B to (!B) A and I think that should work.

However, you'll get the best performance if you can translate these rules into regexes.
Also, your rule for Char should probably be a regex since Instaparse's ebnf mode only supports character ranges through regexes. (Instaparse's ABNF mode does directly support character ranges, but it is a slightly different syntax - see https://github.com/Engelberg/instaparse/blob/master/docs/ABNF.md).

@eerohele
Copy link
Author

eerohele commented Nov 5, 2015

Thanks for the quick reply!

I will give negative lookaheads a go. Regarding character ranges, I've already translated them into regexps, although I'm not quite sure whether I've got the syntax 100% right. For example, I changed Char to:

Char ::= #"\\u9"
       | #"\\uA"
       | #"\\uD"
       | #"[\\u20-\\uD7FF]"
       | #"[\\uE000-\\uFFFD]"
       | #"[\\u10000-\\u10FFFF]"

@aengelberg
Copy link
Collaborator

A few things I noticed:

  1. You need \x{123456} instead of \u123456 for Unicode code points that aren't 4 digits.
  2. Make sure you know how many backslashes you want for those escape characters. You need only ONE backslash if you're reading the parser from a file, but you need TWO if you're working in a string in Clojure code.
  3. The most performant option is to combine all the char ranges into one regex.

Here is my edited version, with those three points taken into consideration (assuming you want single backslash):

Char ::= #"[\x{9}\x{A}\x{D}\x{20}-\uD7FF\uE000-\uFFFD\x{10000}-\x{10FFFF}]"

@eerohele
Copy link
Author

eerohele commented Nov 9, 2015

@aengelberg: Many thanks for the suggestions! Your version of Char works great, and I can use the information you provided to fix the other rules, too.

I haven't yet quite managed to wrangle CommentContents and NCName into regexps that work perfectly, but that's due to my lacking regexp-fu, not Instaparse. I'll keep working on it and post the EBNF I end up with here in case someone else finds it useful.

In the meantime, you can close this issue as far as I'm concerned, unless you want to keep it open for tracking the exception symbol issue.

@Engelberg
Copy link
Owner

Glad to hear you're on the right track now. I'm going to keep the issue open for the exception symbol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants