-
-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parse_json() function to omm #129
Open
CDWimmer
wants to merge
1
commit into
brandon-rhodes:master
Choose a base branch
from
CDWimmer:parse_json
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happily, I think we can have users call
json.load()
directly. It looks like here you take the dictionary returned byjson.load()
, and then build a second identical dictionary, and then throw the first one away, which I think users might find a little slower than just using the original dictionary?See if you can update this to make no change to
omm.py
, but instead make the testtest_omm_json_matches_old_tle()
just calljson.load()
itself. If it works and passes, then after this lands, I'll update the docs to show the same maneuver.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternately, since I apparently chose to alias the function
csv.DictReader
asparse_csv
in this file, we could do something likeparse_json = json.load
. But I suspect that we should simply avoid hiding the lower-level routines, in case folks need to use their options for things like encodings.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This response accidentally became 3 times longer than I would have liked and comes with the largest pinch of salt ever seen by god or man, I can only apologise:
Good spot, I think this thought was trying to wrestle its way into my head earlier today!
However, I do believe wrapping it in a def provides a benefit:- your
parse_xml()
produces a generator rather than alist
as would be returned by a rawjson.load
, so it will have consistent output regardless of input. A correction would beto achieve the effect without throwing anything away.
A few more thoughts I've had materialise:
json.load
takes a bunch of extra arguments that I don't think are very helpful here, and it would be good to not pollute users' IDEs with things that don't make sense to be present in this context.json.load/loads()
has the annoying (for us) property that it will convert whatever JSON is entered into its Python equivalent without complaint, meaning were a user to not wrap their satellite data object in an array, they would receive no error relating to the totally valid JSON, but get an unexpected return type fromjson.load()
-> a plaindict
. (see) One could build functionality into this to ensure a consistent return type, or throw a more helpful error perhaps telling the user to reformat their JSON.A fairly important note I think: I'm not sure either of us have used
yield
properly? I think we need to moveyield
inside thefor
block to actually take advantage of it, then construct the return value on the fly instead of building a list, e.g.This way, we keep generator behaviour, don't throw away any data, and only generate a single
dict
at a time while still allowing forfor ... in parse(json)
usage.It's slightly slower for a handful of datasets, but almost 200,000 lines of JSON relating to tens of thousands of satellites that is only increasing, it'll be vastly more memory efficient! Who knows what devices will be running this stuff after all.
I am sure similar can be sought from parse_xml but I am actually falling asleep :(
Getting slightly off-topic:
It could be made to accept both a filepath/object (i.e. stringIO) and a raw JSON string with a simple use of
isinstance()
and swapping tojson.loads()
(load _s_tring), or internally re-wrapping with stringIO, not sure on best approach. Same could be applied to the xml an csv parsers - because lord knows users will find incredible new ways to get their data into memory (I got here due to wanting to use a database!)I'm tempted to suggest that normalising this to also produce a generator would be good :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming back to this after some sleep:
I'm a dullard.On the one hand, I think it just makes more sense to return alist
rather than useyield
in all parse functions, the entire active satellite JSON array only comes to about 3.7MB. Even with added python overheads, if whatever is running the application can't handle loading that, then it probably can't handle the rest of the application either!I'm back on board Team "Just Wrap-
json.load
", except swappingyield
s forreturn
s given we're not doing any extra processing, unlike with xml where obviously you are doing more processing within a loop first.I suppose it just bothers me that you get inconsistent results depending on the parser used, even if practically they are interchangeable in most use cases.
For comparison's sake I'm running some timing tests on each function, will edit this again when done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get just under 4MB of "compact" json (
jq -c
) from the combined active and analyst catalogs. Doing some quick'n'dirty tests, it seems that on my particular machine (linux x86_64), with my particular build of python (3.10.2), loading orbital elements takes about 1.15MB per thousand records. Let's maybe bump that up to 1.25MB per thousand records to accommodate all objects having non-emptyOBJECT_ID
and descriptiveOBJECT_NAME
. I don't have any 32-bit machines handy to test with.This would be easier if the catalog was "ndjson" and you could have a generator that yields a line at a time which then gets parsed. Alas, I know of no existing incremental/streaming JSON parser that would return one dict at a time from a list of dicts..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alas, it will be this weekend before I can again look at these details, but I'm happy to see y'all engaged in an entertaining discussion of different approaches. I'm also happy to see CSV winning, as I think it's by far the most sensible (as well as widely-supported format), as it makes no sense to me here to repeat every field name for every single satellite.
So the only aside I'll make at the moment is: have you tried the Pandas
read_csv()
routine? If we are going to have a documentation section for folks who need to load up very large numbers of satellites, we will probably want to outline—in case they can install and use Pandas—how to load satellites, sort them, and select them from a very large CSV using a Pandas dataframe.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree about csv, ultimately it is very predictable tabular data, of course it should be in a table! So it makes perfect sense to use pandas DataFrames too.
Though, I don't know how consistent pandas has been across major Python revisions - this library supports 2.6 but the latest version of pandas requires >=3.9, so worth checking if there's version-specific caveats for examples/documentation. (Or release a final 2.X/3.5 version and then drop support, I am begging for type annotations 😢 3.5 is 9 years old!)
Generally I think docs should be an outline of "you can use this set of built in parsers provided by the library that are guaranteed to work with CelesTrak data (examples) [...] or if you want more advanced control of the data you can construct your own parser that has output in XYZ format... (pandas example that uses some DataFrame feature)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Your Scientists Were So Preoccupied With Whether Or Not They Could, They Didn’t Stop To Think If They Should"
I doubt that json loading will ever be faster than csv, but it was fun to see if I could do it without reading the whole file in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean its performance? I have found that, if I'm careful, I can get it to switch into pure-C mode for CSV parsing in all versions that I try. Let me know if you hit a snag where it tries using Python parsing, and I'll try finding a workaround for you. Note that people using old Python versions usually just download and install the Pandas version that was current at that time, and can use Pandas just fine as long as they don't rely on recent tweaks and features.
Just to make things simple, I've dropped the "2.6" from
setup.py
and from now on plan to only advertise 2.7 and later. I think it's fair at this point to tell Python 2.6 users to consider themselves permanent users of Skyfield 1.48.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was unclear, I just mean if there have been changes to the "api", how things are used, feature changes etc.
Of course yeah :) Just worried if there were changes to how pandas does things then documentation examples might want to reflect that, but an excercise for the reader never hurt ;)
That is great news about the C code, will see if I can get a quick something to do that, would you just
df = load_csv()
and then dofor row in df.iterrows()
as a fair(ish) comparison? Also how do you tell when it's doing pure-C? Just the time it takes?Sounds good, though I did read a survey suggesting only about 1% or so of Python users were using 2.x now, though I suspect some physics/astronomy people aren't going to follow suite with that stat, do you know if it's known what the userbase of this is like?
edit:
Results
Running CSV builtin test
Current memory usage is 5.048KB; Peak was 49.077KB; Diff = 44.029KB
CSV builtin avg time: 0.1141 s
Running CSV pandas test with default(?) backend
Current memory usage is 93.833KB; Peak was 8145.391KB; Diff = 8051.558KB
CSV pandas avg time: 0.6323 s
Running CSV pandas test with numpy_nullable backend
Current memory usage is 195.294KB; Peak was 7555.298KB; Diff = 7360.004KB
CSV pandas avg time: 0.8052 s
Running CSV pandas test with pyarrow backend - no idea why they're planning to make this default
Current memory usage is 381.754KB; Peak was 7420.537KB; Diff = 7038.783KB
CSV pandas avg time: 0.9581 s
I must be doing something wrong...