-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resumable parsing using continuation primitives #37
Comments
On further consideration, I don't think that Attoparsec-style resumption is useful enough. All it gets us is amortization of the cost of parsing itself, but it gives us no usable incremental results. For example, when we get parsing input in a bunch of chunks, we can immediately parse a chunk when it arrives, but we get no output at all until all chunks are parsed. All we get is potential latency improvement compared to just buffering bytestring chunks and parsing them in one go, but I find it unlikely that |
I've just completed writing a JSON parser from scratch (GaloisInc/json#17) and I used
Only in the most basic implementation. If you wrap the results into a Stream, you can make a
Yes, since we can't know that the input is correct until we've traversed all of it, the only benefit of a chunked parser is the potential for lower memory and latency overheads. A strict parser will always be more performant on small inputs, but it will become prohibitively expensive once we're into hundreds of megabytes of input data. Also I don't think a chunked parser can share the foundation with the strict one. From my understanding there are two extra operations I would want:
|
Can you tell me more about your use cases? If we're talking about hundreds of MBs of JSON, and the output does fit reasonably into memory, it should be parseable in at most a few seconds in Ofc supporting input resupply is better than not supporting it and there are use cases for it. Since my last comment here I thought a bit more about the implementation, and one question is: do we want "one-shot" or "multi-shot" resumption (borrowing effect handler terminology)? In the former case a suspended parser can be resumed only once, because resumption is destructive. The former case can be implemented a lot more efficiently, simply using a new GHC thread for the parser which blocks on an MVar when it runs out of input. The latter requires making a copy of the parser stack whenever we run out of input, so it requires continuation Do you see any plausible use case for multi-shot resumption? What it allows is essentially branching parsing with multiple different supplied inputs. I've never seen applications for this to be honest.
I don't quite understand this point. I think any chunked/resumable parser would keep track of required chunks just by virtue of them being live data, and not being GC-d. On a backtracking |
The only real one I know of that assumes chunking by the default is
Two distinct problems here:
I'm out of my depth here, but after a bit of reading on this I think this question is just "what if we take Input chunks, in this specific case, are always constant once discovered, so supporting an operation like this is entirely optional. This sounds "one-shot", but I don't know if running the parser against an
My implementation already requires I can't say anything on the performance front because my main gripe with |
Now that I look at the list of restrictions Me saying "I would use |
Sounds like we need |
Chunked parsing is definitely not off the table, nor comprehensive bit-width and endianness support. I actually think we're not far from 32-bit and big-endian support on host machines, it's just that so far no one has cared about it. The current features are based purely on what users want to have, although it is true that current users want to have high performance. Now that @BurningWitness expressed desire for it, there's a better chance for chunked parsing being added. Technically, I don't think there's any issue. Basically, we need to convert the end-of-buffer pointer and the |
We already provide native byte order machine integer parsers. They should only require a small CPP addition to also support non 64-bit word size systems. (I didn't pursue it because all I needed were explicit size & endianness parsers.) |
Turns out I was wrong, Haskell does have a streaming parser in that sweet spot between From looking at it |
So far, resumable parsers were really only feasible with CPS-based internals, like in Attoparsec. Unfortunately, that has painful overhead because it effectively moves the control stack of the parsers to the heap, and continuation closure calls are also slower than native GHC returns.
GHC 9.6.x is now available in release candidates, and it has new primops for delimited continuations: https://github.com/ghc-proposals/ghc-proposals/blob/master/proposals/0313-delimited-continuation-primops.rst It might be interesting to investigate the new primitives. The idea would be to save the current parser continuation when running out of input, instead of throwing an error.
The text was updated successfully, but these errors were encountered: