Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API changes 0.9 -> 0.10 #60

Closed
Alexander-Barth opened this issue Jan 10, 2020 · 19 comments
Closed

API changes 0.9 -> 0.10 #60

Alexander-Barth opened this issue Jan 10, 2020 · 19 comments

Comments

@Alexander-Barth
Copy link
Owner

Alexander-Barth commented Jan 10, 2020

It would be nice to have a 1.0 release of NCDatasets. There might be some aspect of the API that need to be deprecated:

using DataStructures
ncdata = defVar(ds,"data",Int8, ("time",), 
     attrib = OrderedDict("scale_factor" => 10., "add_offset" => -1.))

instead of

ncdata = defVar(ds,"data",Int8, ("time",))
ncdata.attrib["scale_factor"]  = 10.
ncdata.attrib["add_offset"]  = -1.

Such attributes are scale_factor, add_offset and _FillValue, unit and calendar (for time variables).
Otherwise it is not possible to know the element-type of ncdata when it is created (as it can change). I belief that this is necessary address #40. You can also use Dict in the example above, if the order of attribute is not important.

  • Do not use Union{Float64,Missing} (or similar type) if a "_FillValue" attribute is not defined, as element-type. We would just use Float64 (or similarly Float32, Int8, ...).

  • The constants: NC_FILL_BYTE, NC_FILL_CHAR, NC_FILL_SHORT, NC_FILL_INT, NC_FILL_FLOAT, NC_FILL_DOUBLE, NC_FILL_UBYTE, NC_FILL_USHORT, NC_FILL_UINT, NC_FILL_INT64, NC_FILL_UINT64, NC_FILL_STRING are have to be replace by e.g. fillvalue(Int8).

Let me know, if you think that additional breaking changes are necessary before 1.0 by replying to this issue or if you disagree with what I proposed.

In any case, I think we will need a version 0.10 where we test all these changes.

@Datseris
Copy link
Contributor

Datseris commented Jan 10, 2020

Hi there, I was about to open a new issue regarding Axis array but this fits more. Seems like a lot of the features requested by users like @Datseris , and the issues described in e.g. #48 would be solved by the following extra point for a 1.0 release:

This Julia package targets more the functionality of e.g. Python's xarray, including a functionality similar to what we would like in #47 . Thus, seems reasonable to try and make NCDatasets.jl work flawlessly with AxisArrays.jl instead of trying to re-implement similar functionality here. This would also remove pressure from you deciding whether numeric operations should be here or not, as they would be handled by AxisArrays while preserving the core structural form of the data.

Another point I'd strongly recommend to consider for 1.0 is:

  • Think whether the syntax x[:] should be kept the way it is or deprecate in favor of Array(x) or make x[:] make what it would do in a Julia array and return a Vector.

Regarding the rest of your points I give 👍 to all.

@Datseris
Copy link
Contributor

AxisArrays seems to already integrate with Unitful.jl, so from our end we would only care about converting the string present in units to a Unit type from unitful.

@Datseris
Copy link
Contributor

My proposed syntax would be: Array(v) does what it does already, while AxisArray(v) makes an AxisArray instead. Probably some shorthand syntax would be nice to have as well. Darn, why didn't we think about just passing everything to AxisArrays so far? :D

@Datseris
Copy link
Contributor

Another thing to consider, if I may: some kind of warning should be thrown when accessing an Variable and co. with integer indices, to stop problems like #48. This warning should instruct the users to convert first to Array or AxisArray, and would happen once a user passes a Variable into functions like mean, which at the moment silently freeze if you have a large dataset.

@Alexander-Barth
Copy link
Owner Author

Alexander-Barth commented Jan 10, 2020

Think whether the syntax x[:] should be kept the way it is or deprecate in favor of Array(x) or make x[:] make what it would do in a Julia array and return a Vector.

I guess, you have seen the discussion at #33. So there are already at least two users who do not like this at all :-) and it is indeed inconsistent with all others arrays in julia (but surprisingly, numpy and python-netCDF4 also copy the data without flatting the array when indexing with [:]). OK, let's deprecate it so that we can change the behaviour.

Array does already work (for CFVariable arrays), but what about NetCDF scalars? I like read(ncvar) as implemented in HDF5.jl. This read function would also need to call the transform function and potentially return something which is not a Base.Array (such as Dagger.Distribute) .

some kind of warning should be thrown when accessing an Variable and co. with integer indices, to stop problems like #48.

Would you make such warning unconditionally (I mean even if the user just loads a single element)? What you do think about CuArrays.allowscalar(false) ? Maybe we could have NCDatasets.warnscalar (and true being the default)

AxisArrays seems indeed like a good candidate. Are you aware of DimensionalData.jl and GeoData.jl ? Do you have any opinion which one it preferable? The later is quite tailored to geospatial data.

@Datseris
Copy link
Contributor

I have not read #33 yet, thanks for letting me know. Yes, me and Simon raise mostly the same points, and these points fairly come from parallelizing array indexing with base Julia. In the end, the best expectation is that a user is already familiar with using normal Julia arrays, and should/would thus expect similar behavior throughout the Julia ecosystem.

What you do think about CuArrays.allowscalar(false)

Sorry, I am not familiar with CuArrays, but I am familiar with using environment variables, or global variables in the module for such a thing, like you suggest with NCDatasets.warnscalar. I think its a good idea. If one wants to be really specific about it, they can make such a variable a field of NCDataset, which defaults to true and throws such a warning, but to be truthful I don't think such specificity is worth it and I would go with a global variable like NCDatasets.warnscalar. It is fair to say that the scenarios where integer accessing happens outside the user's call (like e.g. in mean) far outweigh the scenarios where a user would deliberately need a single numeric value from an entire .nc file.

For the actual case where a scalar is indeed written on a NetCDF file, your suggestion of read(file) seems the most sensible to be honest. it is fair to say that array indexing syntax [stuff, stuff,] is really meant for arrays so it shouldn't make much sense to use the same syntax to access scalars.


AxisArrays seems indeed like a good candidate. Are you aware of DimensionalData.jl and GeoData.jl ? Do you have any opinion which one it preferable? The later is quite tailored to geospatial data.

That is an extremely important point and one I immediately (like literally 5 minutes after installing AxisArrays) encountered after using AxisArrays.jl. It worried me a bit that the community is not yet settled in AxisArray-like data, as is evident by the existence of numerous packages that do the same thing and also by the very worrying existence of https://github.com/JuliaCollections/AxisArraysFuture . This all hinds to instability, which I never like :P Give me some time, at the moment I have two goals for my existence: 1 to establish an axis-array-like conversion scheme in NCDatasets.jl (as I find it so darn useful for day to day work). I will be testing a lot of them, and also opening an issue in discourse asking the community about the status quo. My 2nd goal is to establish a direct plotting scheme that goes from Variable/CFVariable into a plotted map of the field. I'll implement this in https://github.com/JuliaClimate/ClimatePlots.jl . In fact, my plan is to directly implement a plotting method that dispatches on AxisArray, but of course, I first must conclude which package should be used for AxisArray.

@Datseris
Copy link
Contributor

I am clueless about what GeoData.jl is, as I can't even tell the difference between DimensionalData.jl. Maybe the author @rafaqz can help us here on how we should proceed and to what should a CFVariable be converted...

@rafaqz
Copy link
Contributor

rafaqz commented Jan 11, 2020

The community is far from settled, there are about 5 candidates for named dimension packages being written at the moment. DimensionalData.jl is mine, because AxisArrays is not very well written if you need to extend it in any way.

DimensionalData also defines a range of grid behaviours that allow you to track cell properties and grid bounds and other things, so eg the data can be reversed and subsetted and it is still plotted the right way at the right coordinates. These parts will eventually be moved to DimensionalArrayTraits.jl so other packages can use them. You can also use unitful units or DateTime units. It can also handle rotated grids and cordinate transforms (although this is still a WIP).

GeoData extends DimensionalData.jl to add spatial types like a GeoStack - that can be either a list of files for different layers, or a netcdf etc that contains multiple layers. This means you can use a lot of data structures without changing any code except your load script. GeoSeries does this at a higher level, when you have arrays or stack files that are distributed along some dimension like time - it lazily loads them as you need them. See this example.

What most of the code in GeoData does is implement these types for multiple backends, so you can swap them with the same code. Base/statistics methods work on them as in DimensionalData.jl, and the result is always a GeoArray with updated dimensions.

GeoData.jl already wraps NCDatasets.jl for load/save and manipulation. I use it in production and its getting fairly stable. I have a whole stack of modelling packages built on this, so it is definitely going somewhere as my work and research depends on it. I should make a first release some time in the next month.

My opinion would be not to implement dimensions again here, just use GeoData.jl and help me improve it, instead of every spatial package doing something custom and different.

@Datseris
Copy link
Contributor

The community is far from settled, there are about 5 candidates for named dimension packages being written at the moment

Both sad as well as surprised to hear this, especially given how well developed and standarized these things are in other languages, and how fundamentally necessary they are for any geo/climate related work. Since I want better support, I'd better "get my hands dirty" and start contributing even more then.

My opinion would be not to implement dimensions again here, just use GeoData.jl and help me improve it, instead of every spatial package doing something custom and different.

Sure, at least I'd be very happy to avoid as much duplication as possible and re-use, improve and extend existing code. But no matter how willing I am, at the moment it is not possible... :(

As far as I personally can say, the documentation of GeoData.jl is not at a point where anyone besides the main developer can understand what is happening. At the moment the documentation is simply a sequence of documentation strings. This is enough for someone that already knows the reasoning behind the package, as well as the sequence of steps one typically does when using it, and thus does not need aid in understanding. But unfortunately for someone who doesn't already know the package, just listing documentation strings doesn't take them far. I am happy to help with improving, and making the docs more welcoming for people, but this is something that you'd have to start, if you want others to start using, and improving, your creation (at the moment I understand so little that I find myself unable to do any contribution). The best thing to start would be actual examples or code snippets that highlight basic components, as well as guiding the person through the typical use of a package. A good docs is the n1 thing that will attract more people. (DimensionalData.jl is in a similar situation by the way, but since it has much less complexity built on top of it I could at least understand its basic components).

To simplify things for my own understanding by the way, is GeoData.jl simply an extension of DimensionalDatasets.jl that (1) adds extra metadata to the array (2) allows directly loading from NCDatasets into GeoArray (3) defines plotting recipes for GeoArray? I'll try to not derail too off-topic here, talking too much about other packages. But this situation calls for a discourse post.

@rafaqz
Copy link
Contributor

rafaqz commented Jan 11, 2020

GeoData.jl isn't released. If you want to do some work on the docs, that would be great - I'll try to get that a little further along. As I've said elsewhere this is one of 15 packages in progress I don't really have time docs, I just need the functionality asap because the ecosystem doesn't provide it elsewhere. Polishing will come later.

There was an intial discourse post on DimensionalData and GeoData a few months ago. DimensionalData.jl is published, GeoData.jl isn't - there will be an updated post when it is.

As a developer, if you want to understand a package read the tests. The docs are for users.

@Datseris
Copy link
Contributor

Oh, sorry, my bad. Didn't know it wasn't registered, but this statement:

My opinion would be not to implement dimensions again here, just use GeoData.jl and help me improve it, instead of every spatial package doing something custom and different.

certainly made me interpret it as actually being registered, as only registered packages can be dependencies of NCDatasets.jl (at least at the moment).

@rafaqz
Copy link
Contributor

rafaqz commented Jan 11, 2020

It will be published soon. But you also gave up very quickly on exploring the option and instead pushed the work back on to me. You should at least read the tests.

You also need to realise how much work it requires to get it to the current point of generality - pull requests to NCDatasets, ArchGDAL, HDF5, and writing GeoFormatTypes and DimensionalData.jl.

I'm aware of the documentation needs, and lack of docs doesn't mean I shouldn't suggest you use GeoData.jl - you asked me to comment here and we are talking about future functionality and working together towards that.

@Datseris
Copy link
Contributor

I've opened this so its easier to communicate and work towards a common goal: https://discourse.julialang.org/t/the-fate-of-dimensionalarrays-axisarrays-in-julia-and-which-to-actually-use/33253

@rafaqz
Copy link
Contributor

rafaqz commented Jan 11, 2020

What you are missing is that these things have all been discussed at length in multiple github issues, it's not like we are just realising this is a problem.

@Datseris
Copy link
Contributor

@Alexander-Barth another point to consider for 1.0: Dropping the Compat dependency and whatever old thing it is supposed to support.

Alexander-Barth added a commit that referenced this issue Jan 12, 2020
@Alexander-Barth
Copy link
Owner Author

Alexander-Barth commented Jan 12, 2020

@Datseris, of course thanks for reminding me to drop Compat!

Alexander-Barth added a commit that referenced this issue Jan 12, 2020
@Datseris
Copy link
Contributor

@Alexander-Barth I know this will be a bit "too breaking" or "too heavy", but still I believe it is worth considering for a 1.0 change: "What should the syntax v[...] return?" Depending on the status of what dimensional data fits to use as a transform for NCDatasets.jl (which I'll test extensively this week), it may be quite lightweight to transform directly to DimensionalArray/GeoArray instead of normal Julia array. Depends on your vision of course, but if one does not pay too much performance, would you find it reasonable that v[,,..] would return an e.g. DimensionalArray instead?

This comment is also inspired by the behavior of typically using .nc data in Python, when one gets an xarray once loading into memory, instead of a numpy array.

@Datseris
Copy link
Contributor

@Alexander-Barth defVar and defDim should end in !, as they are in-place operations on the first argument.

@Alexander-Barth
Copy link
Owner Author

I see defVar/defDim similar to write(io,data).

@Alexander-Barth Alexander-Barth changed the title API changes API changes 0.9 -> 0.10 Nov 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants