Unflatten: Stream all unflattening. #377

kindly · 2021-02-09T16:05:19Z

Unflatten:  Stream all unflattening.

* Uses openpyxl read_only mode
* Uses zodb storage to save incoming data into buckets based on id_name
  in top level objects.
* Any object without id_name gets given random key.
* Runs unflatten seperately on all these buckets.
* For JSON use jsonstreams to stream output into both result JSON and
  cell_source_map JSON. These are the only two files that are likely to
  be large.
* For XML use lxml xmlfile to stream unflattened xmldata.

* Use ijson * Use pyopenxl write_only mode * Store sheet lines in an embedded btree ZODB index #316

* Uses openpyxl read_only mode * Uses zodb storage to save incoming data into buckets based on id_name in top level objects. * Any object without id_name gets given random key. * Runs unflatten seperately on all these buckets. * For JSON use jsonstreams to stream output into both result JSON and cell_source_map JSON. These are the only two files that are likely to be large. * For XML use lxml xmlfile to stream unflattened xmldata.

kindly · 2021-02-09T16:08:20Z

Memory use graphs for 80MB JSON file.

https://docs.google.com/document/d/1QuDWwBDF9uqE1K8hG47RDvyrcYPlKAOs7eNeHEmmhNY/edit?usp=sharing

jpmckinney · 2021-02-09T22:39:09Z

Nice! I hadn't seen the jsonstreams library before. In OCDS Kit, we use json.iterencode, generators and a custom encoder: https://github.com/open-contracting/ocdskit/blob/master/ocdskit/util.py#L24

kindly · 2021-02-10T11:08:01Z

@jpmckinney that approach is interesting and fairly straitforward. For this case I needed one iterator to write to 2 files (result and cell_source_map) and did not want to iterate over the data again, so having an generator for a list value would not have worked. It also deals with object streaming.

I think jsonstreams could be faster though, so was thinking of replacing the encoder with ujson which will be a bit of a hack but seems possible. However, the profile data shows that the work is fairly evenly split between persisting, doing the actual work and writing. So optimizing just one of these will not make much difference.

Interesting just using pypy instead of cpython does raise the memory usage (for my test case to ~250MB from ~150MB) but more than doubles the speed to ~50sec from 110sec. I think the increase of memory is just pypy overhead though and should grow dis-proportionally with the size of file. It also makes the "doing the work" section a lower proportion of the runtime to ~1/5 from about ~1/3. Without persistence but streaming pypy takes 35sec/900MB and cpython 90sec/700MB.

So overall I think we should encourage pypy, if speed is needed, and it most likely makes a larger difference than any set of optimizations in the code itself.

jpmckinney · 2021-02-12T15:32:52Z

What's the difference between persisting and writing? I/O will be the ultimate bottleneck, but optimizing the rest can make a difference, like using PyPy. In my experience (and it seems ujson's metrics), orjson is the fastest encoder. OCDS Kit uses orjson for encoding if it's available, and otherwise uses the standard library: https://github.com/open-contracting/ocdskit/blob/master/ocdskit/util.py#L74-L92

kindly · 2021-02-12T16:04:53Z

Persisting is getting the original data from a spreadsheet onto disk intially so that that it can be sorted by top level id (release.id or ocid in OCDS case). We can not stream this as the same top level ids exists across sheets.
Writing is then actually wring the JSON/XML file at the end.

Yes orjson is faster but will not be compatable with jsonstreams as it uses bytes not str. So I was not sure it would be faster than ujson if you then convert to str before putting it into jsonstreams. orjson does not support pypy and seems harder to compile for windows users.

kindly · 2021-02-12T16:13:57Z

Actually I tried changing the persistance backend to leveldb from zodb and have a private branch for that. This was actually faster for both flattening and unflattening and it used orjson to encode and decode from leveldb. leveldb also needs bytes only for its keys and values.

The reason for not using this though was that leveldb is also hard to compile for windows and it does not support an in memory mode and would be probably be slower if you needed to convert everything to bytes when using the normal JSON encoder.

Bjwebb

toxml in xml_output.py can now be removed, as its been replaced with your new functions. (I couldn't comment on it directly as its not in the diff).

As with the flatten PR, some high level comments could be good.

Bjwebb · 2021-03-05T15:57:53Z

flattentool/input.py

+        index = 0
+
+        for sheet, rows in self.get_sub_sheets_lines():
+            for row_numbar, row in enumerate(rows):


row_numbar sb row_number

Bjwebb · 2021-03-05T16:01:24Z

flattentool/__init__.py

@@ -179,7 +187,103 @@ def decimal_default(o):
    raise TypeError(repr(o) + " is not JSON serializable")


+# This is to just to make ensure_ascii and default are correct for streaming library


I think this doesn't quite grammar correctly.

Possible change:

This is to just to make sure ensure_ascii and default are correct for streaming library

Bjwebb · 2021-03-05T16:06:38Z

flattentool/input.py

@@ -502,10 +587,12 @@ def fancy_unflatten(self, with_cell_source_map, with_heading_source_map):
        return result, ordered_cell_source_map, heading_source_map


-def extract_list_to_error_path(path, input):
+def extract_list_to_error_path(path, input, index=None):


I think this function is a bit confusing. The actual implementation is fine once I got my head around it, so maybe it just requires a bit of explanation.

Bjwebb · 2021-03-05T16:07:12Z

examples/bods/unflatten/expected/out.json

@@ -1,24 +1,24 @@
 [
    {


I don't like the indentation mismatch here between this opening { and the closing one. All the other examples look okay, so I think this has to do with having a list at the root.

Looks like this is a jsonstreams bug, I can reproduce with:

import jsonstreams with jsonstreams.Stream(jsonstreams.Type.array, filename='test.json', indent=4) as s: s.write({"a": 1, "b": 2})

I will report this upstream. Happy to see this merged as is.

Reported dcbaker/jsonstreams#41

kindly added 2 commits January 28, 2021 13:35

Flattening: Reduce memory Footprint.

b12fe74

* Use ijson * Use pyopenxl write_only mode * Store sheet lines in an embedded btree ZODB index #316

kindly requested a review from Bjwebb February 9, 2021 20:52

Bjwebb requested changes Mar 5, 2021

View reviewed changes

kindly force-pushed the 315-lower-memory-usage branch 2 times, most recently from 123d981 to 31b9399 Compare March 9, 2021 09:40

jpmckinney mentioned this pull request Jan 11, 2023

fix: Import backports-datetime-fromisoformat only if needed, to fix PyPy 3.7 support #415

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unflatten: Stream all unflattening. #377

Unflatten: Stream all unflattening. #377

kindly commented Feb 9, 2021

kindly commented Feb 9, 2021

jpmckinney commented Feb 9, 2021

kindly commented Feb 10, 2021 •

edited

Loading

jpmckinney commented Feb 12, 2021

kindly commented Feb 12, 2021 •

edited

Loading

kindly commented Feb 12, 2021

Bjwebb left a comment

Bjwebb Mar 5, 2021

Bjwebb Mar 5, 2021

Bjwebb Mar 5, 2021

Bjwebb Mar 5, 2021

Bjwebb Mar 5, 2021 •

edited

Loading

Bjwebb Apr 14, 2021

		@@ -179,7 +187,103 @@ def decimal_default(o):
		raise TypeError(repr(o) + " is not JSON serializable")


		# This is to just to make ensure_ascii and default are correct for streaming library

Unflatten: Stream all unflattening. #377

Are you sure you want to change the base?

Unflatten: Stream all unflattening. #377

Conversation

kindly commented Feb 9, 2021

kindly commented Feb 9, 2021

jpmckinney commented Feb 9, 2021

kindly commented Feb 10, 2021 • edited Loading

jpmckinney commented Feb 12, 2021

kindly commented Feb 12, 2021 • edited Loading

kindly commented Feb 12, 2021

Bjwebb left a comment

Choose a reason for hiding this comment

Bjwebb Mar 5, 2021

Choose a reason for hiding this comment

Bjwebb Mar 5, 2021

Choose a reason for hiding this comment

Bjwebb Mar 5, 2021

Choose a reason for hiding this comment

Bjwebb Mar 5, 2021

Choose a reason for hiding this comment

Bjwebb Mar 5, 2021 • edited Loading

Choose a reason for hiding this comment

Bjwebb Apr 14, 2021

Choose a reason for hiding this comment

kindly commented Feb 10, 2021 •

edited

Loading

kindly commented Feb 12, 2021 •

edited

Loading

Bjwebb Mar 5, 2021 •

edited

Loading