Implement from_reader deserialization without loading all content in memory #549

gwen-lg · 2024-10-18T21:34:25Z

The purpose to use a reader is to not have to load/wait all content to start processing.

For the moment, a call to de::from_reader call read_to_end on the reader, and fill a Vec<u8> with the content.
It could be bad, especially with large file.

I have take a look at serde json crate : they have a more adapted implementation.
It could be possible to take their implementation as inspiration to implement it ?
Or even extract the part of the code that manage Read to share it ? (If there is nothing specific to JSON) ?

I could be interested to tried to implement it, but I prefer ask before start to do some stuff if there is some started work on this, or preference about how to implement this.

The text was updated successfully, but these errors were encountered:

juntyr · 2024-10-19T06:44:58Z

I think my immediate answer would be yes and no.

All low-level deserialization is handled by the Parser struct, which internally manages the full &str, which is only accessed in four lines of code. So it might be relatively easy to instead allow the Parser to manage to potentially incomplete buffered string that is read in further if more content is required.

On the other hand, the parser currently has to support (almost) arbitrary backtracking (in the worst case across the entire document), so at some point it needs to have read all content in. The backtracking is required for deserialize_any, where RON needs to sometimes peek quite far ahead to see if we're parsing a struct, tuple, etc.

Since we currently need to at some point have the full RON string in memory to support the worst case, reading it all in in the beginning simplifies the code (and is faster than reading it in bit by bit).

If we really want to support streaming (where possible, again there are cases where we cannot prevent needing access to the full document in memory, but that worst case would be triggered by a user of RON), the first step would be to establish a second cursor in the Parser that tracks the lower bound of content that we still need to access. Once we would have both this new lower bound (never read anything earlier than this again) and the current upper bound (read in more when we try to read beyond it), the parser's inner state could then become a buffer that really on tracks the bytes in between those two. In most cases, both cursors would be the same, and RON would be able to support proper streaming.

I'd be really happy to review your contributions towards this!

gwen-lg · 2024-10-22T13:54:08Z

Thank you @juntyr .
I'm keeping this on my to-do list and will try to do some testing on this shortly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement from_reader deserialization without loading all content in memory #549

Implement from_reader deserialization without loading all content in memory #549

gwen-lg commented Oct 18, 2024

juntyr commented Oct 19, 2024

gwen-lg commented Oct 22, 2024

Implement from_reader deserialization without loading all content in memory #549

Implement from_reader deserialization without loading all content in memory #549

Comments

gwen-lg commented Oct 18, 2024

juntyr commented Oct 19, 2024

gwen-lg commented Oct 22, 2024