-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maintain raw precision instead of down casting to float32 #395
Conversation
@dmey I hope you’re ok with me wanting to upstream your fix, as it’s greatly benefited me over the past year. Btw, do you need to sign the CLA as well? |
Hi @juntyr, sure no problem--give it a go and let me know if I need to. |
I signed it but the action still shows it as unsigned, so perhaps you need to sign it too? |
it should be fine now |
Hi @juntyr and @dmey, thanks for this change! However, I think this is something we would prefer to be an option since it will double the memory requirement (and we do have GRIB fields with over 1 billion values). Potential API:
Do you think you could make this modification to the PR, or would you rather leave it as a request issue? |
Good point! What about the new version, where the dtype is inferred from the GRIB messages, so that cfgrib just exposes whatever dtype the GRIB file uses? |
Ah, nice idea, but it does not really work. GRIB files can encode their values in a number of ways including pretty much any number of bits per value (e.g. 3, 11, 18, 24) and these cannot be expressed as numpy arrays. When we get the array of values from the GIRB file via the ecCodes library, we call an ecCodes function to do so, and this function decodes from the file and always returns an array of float64, regardless of what the encoding in the original GRIB message was; then we convert to float32. There is also now a function to return the values as a float32 array, which in fact cfgrib really should be calling in the first place! So since there usually is not a numpy way of expressing the encoding in the GRIB file, the only choice we have is to ask ecCodes for float32 or float64 (it does the dirty work of doing the conversion); then we can always convert to another numpy representation if needed. (The current behaviour is inefficient because we ask ecCodes for a float64 array, and then we cast to a float32 array.) |
Would there be a way to ask GRIB to return the value in the closest matching numpy dtype (for now just f32 and f64) or to have this functionality inside cfgrib to query the number of bits and choose the closest dtype to request from eccodes based on that? |
Not so easily, as GRIB files can also encode with things such as JPEG and PNG compression, and there are other more complex packing types. In any case, I really do prefer that the user is in control of the dtype of the resulting array so that they can control the precision and memory usage of the resulting xarray. |
Thanks for your explanation! In that case, I agree that your proposed API would work best, and I can implement it. In my use case, evaluating different compression strategies, I do you want to know what the native precision of the GRIB file is. Is there a better way than loading the data in flat64 precision, then casting to float32 and checking if any precision was lost? |
Well, you must have ecCodes installed on your system if you have cfgrib there, so this on the command-line can help, if you're feeling strong enough to handle the gory details:
Look at Section 5 of the output and you will see something like this:
This is a file downloaded from where you pointed me to. You can see that it uses 9 bits per value, but on top of that it has ccsds compression applied. You're probably sorry you asked now :) |
I had to move to a different fork and have reopened this PR at #407 |
This is a fix by @dmey to avoid losing precision when using cfgrib, which David developed while working on the field compression project, which I’ve taken up.