-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unflatten: Stream all unflattening. #377
base: 315-lower-memory-usage
Are you sure you want to change the base?
Conversation
kindly
commented
Feb 9, 2021
* Use ijson * Use pyopenxl write_only mode * Store sheet lines in an embedded btree ZODB index #316
* Uses openpyxl read_only mode * Uses zodb storage to save incoming data into buckets based on id_name in top level objects. * Any object without id_name gets given random key. * Runs unflatten seperately on all these buckets. * For JSON use jsonstreams to stream output into both result JSON and cell_source_map JSON. These are the only two files that are likely to be large. * For XML use lxml xmlfile to stream unflattened xmldata.
Memory use graphs for 80MB JSON file. https://docs.google.com/document/d/1QuDWwBDF9uqE1K8hG47RDvyrcYPlKAOs7eNeHEmmhNY/edit?usp=sharing |
Nice! I hadn't seen the |
@jpmckinney that approach is interesting and fairly straitforward. For this case I needed one iterator to write to 2 files (result and cell_source_map) and did not want to iterate over the data again, so having an generator for a list value would not have worked. It also deals with object streaming. I think Interesting just using pypy instead of cpython does raise the memory usage (for my test case to ~250MB from ~150MB) but more than doubles the speed to ~50sec from 110sec. I think the increase of memory is just pypy overhead though and should grow dis-proportionally with the size of file. It also makes the "doing the work" section a lower proportion of the runtime to ~1/5 from about ~1/3. Without persistence but streaming pypy takes 35sec/900MB and cpython 90sec/700MB. So overall I think we should encourage pypy, if speed is needed, and it most likely makes a larger difference than any set of optimizations in the code itself. |
What's the difference between persisting and writing? I/O will be the ultimate bottleneck, but optimizing the rest can make a difference, like using PyPy. In my experience (and it seems ujson's metrics), orjson is the fastest encoder. OCDS Kit uses |
Persisting is getting the original data from a spreadsheet onto disk intially so that that it can be sorted by top level id (release.id or ocid in OCDS case). We can not stream this as the same top level ids exists across sheets. Yes orjson is faster but will not be compatable with |
Actually I tried changing the persistance backend to leveldb from zodb and have a private branch for that. This was actually faster for both flattening and unflattening and it used orjson to encode and decode from leveldb. leveldb also needs bytes only for its keys and values. The reason for not using this though was that leveldb is also hard to compile for windows and it does not support an in memory mode and would be probably be slower if you needed to convert everything to bytes when using the normal JSON encoder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toxml
in xml_output.py
can now be removed, as its been replaced with your new functions. (I couldn't comment on it directly as its not in the diff).
As with the flatten PR, some high level comments could be good.
index = 0 | ||
|
||
for sheet, rows in self.get_sub_sheets_lines(): | ||
for row_numbar, row in enumerate(rows): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
row_numbar
sb row_number
@@ -179,7 +187,103 @@ def decimal_default(o): | |||
raise TypeError(repr(o) + " is not JSON serializable") | |||
|
|||
|
|||
# This is to just to make ensure_ascii and default are correct for streaming library |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this doesn't quite grammar correctly.
Possible change:
This is to just to make sure ensure_ascii and default are correct for streaming library
@@ -502,10 +587,12 @@ def fancy_unflatten(self, with_cell_source_map, with_heading_source_map): | |||
return result, ordered_cell_source_map, heading_source_map | |||
|
|||
|
|||
def extract_list_to_error_path(path, input): | |||
def extract_list_to_error_path(path, input, index=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function is a bit confusing. The actual implementation is fine once I got my head around it, so maybe it just requires a bit of explanation.
@@ -1,24 +1,24 @@ | |||
[ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the indentation mismatch here between this opening {
and the closing one. All the other examples look okay, so I think this has to do with having a list at the root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is a jsonstreams
bug, I can reproduce with:
import jsonstreams
with jsonstreams.Stream(jsonstreams.Type.array, filename='test.json', indent=4) as s:
s.write({"a": 1, "b": 2})
I will report this upstream. Happy to see this merged as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reported dcbaker/jsonstreams#41
123d981
to
31b9399
Compare