Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement from_reader deserialization without loading all content in memory #549

Open
gwen-lg opened this issue Oct 18, 2024 · 2 comments
Open

Comments

@gwen-lg
Copy link

gwen-lg commented Oct 18, 2024

The purpose to use a reader is to not have to load/wait all content to start processing.

For the moment, a call to de::from_reader call read_to_end on the reader, and fill a Vec<u8> with the content.
It could be bad, especially with large file.

I have take a look at serde json crate : they have a more adapted implementation.
It could be possible to take their implementation as inspiration to implement it ?
Or even extract the part of the code that manage Read to share it ? (If there is nothing specific to JSON) ?

I could be interested to tried to implement it, but I prefer ask before start to do some stuff if there is some started work on this, or preference about how to implement this.

@juntyr
Copy link
Member

juntyr commented Oct 19, 2024

I think my immediate answer would be yes and no.

All low-level deserialization is handled by the Parser struct, which internally manages the full &str, which is only accessed in four lines of code. So it might be relatively easy to instead allow the Parser to manage to potentially incomplete buffered string that is read in further if more content is required.

On the other hand, the parser currently has to support (almost) arbitrary backtracking (in the worst case across the entire document), so at some point it needs to have read all content in. The backtracking is required for deserialize_any, where RON needs to sometimes peek quite far ahead to see if we're parsing a struct, tuple, etc.

Since we currently need to at some point have the full RON string in memory to support the worst case, reading it all in in the beginning simplifies the code (and is faster than reading it in bit by bit).

If we really want to support streaming (where possible, again there are cases where we cannot prevent needing access to the full document in memory, but that worst case would be triggered by a user of RON), the first step would be to establish a second cursor in the Parser that tracks the lower bound of content that we still need to access. Once we would have both this new lower bound (never read anything earlier than this again) and the current upper bound (read in more when we try to read beyond it), the parser's inner state could then become a buffer that really on tracks the bytes in between those two. In most cases, both cursors would be the same, and RON would be able to support proper streaming.

I'd be really happy to review your contributions towards this!

@gwen-lg
Copy link
Author

gwen-lg commented Oct 22, 2024

Thank you @juntyr .
I'm keeping this on my to-do list and will try to do some testing on this shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants