Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating strings with Instaparse #82

Open
zmaril opened this issue Sep 27, 2014 · 16 comments
Open

Generating strings with Instaparse #82

zmaril opened this issue Sep 27, 2014 · 16 comments

Comments

@zmaril
Copy link

zmaril commented Sep 27, 2014

I'm interested in generating strings with instaparse. By this, I mean I'd like to give instaparse a data structure to fill in and a grammar and have it return to me strings that could parse to something that would parse to the given data structure. We use instaparse a good deal in ECHELON and I'd like to use test.check to generate data for testing. I'm still learning how the QuickCheck style libraries work, but I think if we could generate strings via instaparse we could use test.check without too much trouble. I'm happy to develop this myself (and will continue to explore further), but I wanted to check in first to see if it is possible (or has already been quietly done).

For clarification, I'd like a function that does something akin to the following:

(def grammar
  (insta/parser
   "S = name ' ' business
    name = 'Steve' | 'Mary' | 'Bob'
    business = 'Bakery' | 'Shop'"))

(grammar "Steve Shop")
(def grammar
  (insta/parser
   "S = name ' ' business
    name = 'Steve' | 'Mary' | 'Bob'
    business = 'Bakery' | 'Shop'"))

(grammar "Steve Shop")

(insta/generate [:S [:name] [:business]])
;;;[:S [:name "Steve"] " " [:business "Shop"]]

(insta/generate [:S [:name] [:business]])
;;["Steve Bakery" "Steve Shop" "Mary Bakery" "Mary Shop" "Bob Bakery" "Bob Shop"]

I suppose I'd like to give instaparse a partial data structure corresponding to a "skeleton" of a parse tree and have it generate data to fill in the rest of the tree. I haven't thought through how this would play with transformations and the hiding of tags in instaparse yet. Does this seem possible? Is it a dumb thing to ask for?

@zmaril
Copy link
Author

zmaril commented Oct 2, 2014

I've started work along these lines. I think I've covered most of the opts needed to generate all the strings from a simple ENBF. Ordered choice looks like no problem (same as alt) and +/- lookahead looks possible with a filter (just need to figure out the right way to create the appropriate predicate). The only really tricky part that seems left is generating strings based on regular expression. I think writing a compiler that takes regular expressions and turns them into combinatiors from instaparse, the same as was done in cfg.clj and a nbf.clj, should be enough to generate strings (assuming there would be no new opts to cover).

@aengelberg
Copy link
Collaborator

Hi Zack,

This is a really cool idea. Ever since your initial post several days ago
I've been working on a way to generalize this type of problem. It's not
unlike the code you just presented, but instead it uses core.logic. I'm
excited to show it to you once it's polished. Probably sometime this
weekend.

Thanks!
--Alex

On Wed, Oct 1, 2014 at 6:11 PM, Citizen [email protected] wrote:

I've started work along these lines
zmaril@6cff000.
I think I've covered most of the opts needed to generate all the strings
from a simple ENBF. Ordered choice looks like no problem (same as alt) and
+/- lookahead looks possible with a filter (just need to figure out the
right way to create the appropriate predicate). The only really tricky part
that seems left is generating strings based on regular expression. I think
writing a compiler that takes regular expressions and turns them into
combinatiors from instaparse, the same as was done in cfg.clj and a
nbf.clj, should be enough to generate strings (assuming there would be no
new opts to cover).


Reply to this email directly or view it on GitHub
#82 (comment).

@zmaril
Copy link
Author

zmaril commented Oct 3, 2014

Excellent! When I was doing my initial research into generating strings the
only resources I actually came across suggested prolog, so I'm excited to
see what you come up with!

On Fri, Oct 3, 2014 at 12:01 AM, aengelberg [email protected]
wrote:

Hi Zack,

This is a really cool idea. Ever since your initial post several days ago
I've been working on a way to generalize this type of problem. It's not
unlike the code you just presented, but instead it uses core.logic. I'm
excited to show it to you once it's polished. Probably sometime this
weekend.

Thanks!
--Alex

On Wed, Oct 1, 2014 at 6:11 PM, Citizen [email protected] wrote:

I've started work along these lines
<
zmaril@6cff00065bfc4618c7b1be558cc8bb74ab922975>.

I think I've covered most of the opts needed to generate all the strings
from a simple ENBF. Ordered choice looks like no problem (same as alt)
and
+/- lookahead looks possible with a filter (just need to figure out the
right way to create the appropriate predicate). The only really tricky
part
that seems left is generating strings based on regular expression. I
think
writing a compiler that takes regular expressions and turns them into
combinatiors from instaparse, the same as was done in cfg.clj and a
nbf.clj, should be enough to generate strings (assuming there would be
no
new opts to cover).


Reply to this email directly or view it on GitHub
#82 (comment).


Reply to this email directly or view it on GitHub
#82 (comment).

@aengelberg
Copy link
Collaborator

https://github.com/aengelberg/instagenerate

Presenting "instagenerate", a core.logic implementation of instaparse that
can be used to generate inputs to parsers, and more. I think this has the
potential to be very useful.

--Alex

On Fri, Oct 3, 2014 at 6:01 AM, Citizen [email protected] wrote:

Excellent! When I was doing my initial research into generating strings
the
only resources I actually came across suggested prolog, so I'm excited to
see what you come up with!

On Fri, Oct 3, 2014 at 12:01 AM, aengelberg [email protected]
wrote:

Hi Zack,

This is a really cool idea. Ever since your initial post several days
ago
I've been working on a way to generalize this type of problem. It's not
unlike the code you just presented, but instead it uses core.logic. I'm
excited to show it to you once it's polished. Probably sometime this
weekend.

Thanks!
--Alex

On Wed, Oct 1, 2014 at 6:11 PM, Citizen [email protected]
wrote:

I've started work along these lines
<

zmaril@6cff00065bfc4618c7b1be558cc8bb74ab922975>.

I think I've covered most of the opts needed to generate all the
strings
from a simple ENBF. Ordered choice looks like no problem (same as alt)
and
+/- lookahead looks possible with a filter (just need to figure out
the
right way to create the appropriate predicate). The only really tricky
part
that seems left is generating strings based on regular expression. I
think
writing a compiler that takes regular expressions and turns them into
combinatiors from instaparse, the same as was done in cfg.clj and a
nbf.clj, should be enough to generate strings (assuming there would be
no
new opts to cover).


Reply to this email directly or view it on GitHub
<
https://github.com/Engelberg/instaparse/issues/82#issuecomment-57566426>.


Reply to this email directly or view it on GitHub
#82 (comment).


Reply to this email directly or view it on GitHub
#82 (comment).

@zmaril
Copy link
Author

zmaril commented Oct 5, 2014

This is nice. I've played with it some and it does what it says! It does an interesting number of things as I look closer. I don't know core.logic well enough to understand this, but it looks like it could complete partial strings of input as well. As in, you gave it both a partial input and a partial output and it could infer the results for both. Very neat!

I think support for regex (and the jump to more general usefulness) has to be inside instaparse proper. Which, stepping back for a second, is an interesting project by itself. Combining regular expressions and core.logic (via instagenerate) seems like it could be an interesting crossroads.

@aengelberg
Copy link
Collaborator

I've thought about regular expressions a bit, and one thing I'm not sure
how to handle is greedy vs non-greedy regular expressions. In instaparse,
"#'a+' 'a'+" will always parse via the regex first, never getting to
the individual strings (so the parse tree would contain "aaaaa", not
multiple instances of "a"). However, I'm afraid that a naive implementation
in core.logic could return multiple results that distribute the "a"s in
multiple ways.

On Sunday, October 5, 2014, Citizen [email protected] wrote:

This is nice. I've played with it some and it does what it says! It does
an interesting number of things as I look closer. I don't know core.logic
well enough to understand this, but it looks like it could complete partial
strings of input as well. As in, you gave it both a partial input and a
partial output and it could infer the results for both. Very neat!

I think support for regex (and the jump to more general usefulness) has to
be inside instaparse proper. Which, stepping back for a second, is an
interesting project by itself. Combining regular expressions and core.logic
(via instagenerate) seems like it could be an interesting crossroads.


Reply to this email directly or view it on GitHub
#82 (comment).

--Alex

@zmaril
Copy link
Author

zmaril commented Oct 6, 2014

Infinity is something that has been perplexing me with regards to string generation. It's possible to get stuck on a * or something similar and never really get to any other part of the parse tree. Although it's not quite the difference between depth first and breadth first search, it feels similar conceptually. For my application at least, I'm more interested in getting a variety of strings early on and then hitting infinity, as opposed to learning that there is an part of my parser that generates an infinite string first and never getting that variety of strings.

I wonder, do you think it is possible to constrain the length of the string results and return them ordered by length? Like, yes we do have infinite strings, but we will only return the strings of length one then two then three and so on, in the spirit of iterative deepening depth first search. This seems possible with core.logic. If you add in the constraint that the string must be of length N and do something like:

(->> (iterate inc 1) (map (partial generate-possible-strings-of-length-n parser)) flatten)

@gfredericks
Copy link

I coincidentally came across this and thought I would mention that I've been working on a decently-robust test.check generator which takes a regex and generates matching strings -- I'm hoping to release it in this library in a day or two.

It's occurred to me several times that a similar problem is generating strings from grammars, which was of course the original question in this thread.

One thing to keep in mind wrt test.check in particular is that you get the most value when your generators are created using the built-in combinators so that you can take advantage of shrinking. My assumption is that core.logic approaches don't go this route.

@kanaka
Copy link

kanaka commented Oct 1, 2018

As part of research I'm working on for my PhD (browser render testing), I recently created instacheck which is basically instaparse meets test.check. It allows you to define EBNF grammars (instaparse compatible) which are translated into test.check generators. It has some command line modes for generating samples, running test cases/inputs against a test program (with shrinking), and also generating the translated clj generator code for a grammar. It even supports regex in the grammar thanks to the awesome test.chuck from @gfredericks. I just got back from presenting the project at Strange Loop. I think videos should post within the next week.

One critical feature that instacheck supports is the ability to adjust the weights that are used for alternation/choice points in the grammar. This is super useful and probably critical for a lot of applications (although this wasn't obvious until I started using it in anger).

Note that for the moment, instacheck requires patched version of test.check and instaparse. The test.check change to to get the frequency generator to shrink towards the more common weights rather than shrinking to whatever happens to be defined first. I should be able to get rid of this requirement by defining my own frequency generator that basically does weight sorting prior to calling the underlying test.check frequency generator. The changes I have to instaparse are to support the ability to tune the grammar weights: #183, #184. The first allows weights to be configured via comments embedded directly in the grammar. The second patch enables logging of the paths that are taken through a grammar. This allows you to tune weights according to existing test cases by parsing them with the grammar.

@zmaril
Copy link
Author

zmaril commented Oct 2, 2018

Wow! That's awesome to see. Way to go, that's seriously cool, thank you for posting about it on this thread.

@Engelberg
Copy link
Owner

Great work! Now that I understand the purpose of the pull requests, it will be easier to evaluate those. Thanks for this.

@lvh
Copy link

lvh commented May 31, 2020

I would like to generate a specific string from a speciifc parse tree (given a grammar, of course). Is this the right ticket for that?

@gfredericks
Copy link

@lvh I'm not sure what you mean by that; you already have the string? and the parse tree? what "generation" is there left to do? maybe an example would help

@lvh
Copy link

lvh commented May 31, 2020

@gfredericks Right, the idea is you'd either generate a parse tree from scratch, or more likely modify a parse tree and then reserialize. So you can produce a semantic patch () but render it as a diff, for example.

@gfredericks
Copy link

are you just saying you want to unparse a tree? for some grammars there would be more than a single valid output

@lvh
Copy link

lvh commented Apr 20, 2022

Sorry, lost track of this. Yes, that's what I'm saying. I'd be happy with any output; also happy to make the grammar capture more details of the input file to make the output as close as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants