Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ec-46 compatibilty #167

Open
pimmeerdink opened this issue May 30, 2023 · 7 comments
Open

ec-46 compatibilty #167

pimmeerdink opened this issue May 30, 2023 · 7 comments
Assignees

Comments

@pimmeerdink
Copy link
Collaborator

Hey guys! First of all, I'm going to be working with, and on, the s2spy software the coming time: I think we may have met one time at some presentations about the work going on at the IVM, I was presenting work for my Masters thesis in AI. Either way I have an issue I could use your help with.

In a nuthsell: we need to be able to use the preprocess functionality not just with historical data but also modelled data. In the case of EC-46, this means that we don't only have a time, latitude and longitude dimension, but also a /step/ dimension. In essence, this means the amount of steps into the future that that datapoint represents: on every day, for every gridpoint, a simulation is run 30 days into the future. This means that every day in essence has 30 prediction fields associated with it. When it comes to deseasonalizing, this just means that not only do we want to groupby the day of the year, but also over the step into the future: we calculate a normalization coefficient for every x combination. Sounds easy, it's just a groupby and merge over an extra axis. It gets hairy when you realize, as i did, that xarray does not support groupby operations on multiple columns. I have done som dirty coding on the ec-46 compaitiblity branch in s2spy/preprocess to get it working anyways, but it's crazy inefficient and, lets face it, ugly. Wondering if you guys can help: i think the code could potentially use a little restructuring/rethinking given the new use case. I have attached a netcdf file of the format that should be pre-processable by the functions (zipped because i had to...)

Hope you guys can help! Thanks in advance.

input_data.netcdf.zip

@BSchilperoort
Copy link
Contributor

BSchilperoort commented May 31, 2023

Hi Pim, we did meet at that presentation, but Yang was not there. However, if you'll be working on s2spy, I am sure that we will meet (again) some time. It would be good to align our thoughts and ideas in person.

Looking at the file and your description, having the dimension named "time" can be a bit misleading, right? As it is the time that the forecast was made, not the actual time the forecast represents. I think rearranging that would be a good first step.
Possibly with time representing the actual date of the forecast temperature, and instead of step a lead_time (or similar name) representing the number of days ahead the forecast was made.

This issue that @semvijverberg made in Lilio had data structured more like that: AI4S2S/lilio#54

xarray does not support groupby operations on multiple columns.

If you stack your dimensions, you should be able to use groupby on the stacked dimension. This stacked dimension will have a MultiIndex as coordinates.

In the end, it would be nice to be able to support EC-46 and similar forecast/ensemble data in a flexible way, without writing too much custom code in the processing functions.
One way to do this is to have a converter function that takes a certain dataset (e.g. EC-46) and converts it to a format compatible with s2spy. Then we can rely on a certain data format/structure in the rest of the code.

@geek-yang
Copy link
Member

@pimmeerdink Thanks for asking. I agree with @BSchilperoort and you can simply deseasonalize your data with s2spy using stack and unstack trick, which is quite quick and straight forward.

Indeed we are considering supporting ensemble forecasts in a nicer way. In the meantime @semvijverberg and @jannesvaningen are experimenting with lilio and s2spy, using EC-46 data. Once we figure out a nice way and implement a new feature, we will let you know.

@pimmeerdink
Copy link
Collaborator Author

pimmeerdink commented May 31, 2023

Hey guys, thanks for the responses! I'm actually the one experimenting with the ec-46 data at the moment instead of jannes and sem, and of course I agree we would like to write as little custom code as possible, and agreeing upon a data format for data like this would be the way to go. For now, I'd like to get it working in a somewhat practical way, which would help with the future development. However, we know that deseasonalizing for data of this format (so where a double groupby is necessary) will need to be supported.

I agree that stack would seem like the logical choice. However there's a small problem with simply stacking the dimensions: we are unable to access the dayofyear attribute of the time dimensions:

data.stack(doy__step=["time.dt.dayofyear", "step"])
*** KeyError: 'time.dt.dayofyear'

While we can access it directly through just writing "data.time.dt.dayofyear", when used in a stack function this does not seem to work. Intuitively you would then calculate it seperately, assigning it as a new coordinate and then stacking with that, however in that case the dayofyear (doy) that we calculate and assign is a coordinate, not a dimension, and stacking is only possible for dimensions. So basically: I can't figure out how to make the stack thus also the groupby operation work. If one of you could help, that would be great!

@BSchilperoort
Copy link
Contributor

Hi Pim, try this:

import xarray as xr
ds = xr.open_dataset("/home/bart/Downloads/input_data.netcdf")
ds["doy"] = ds["time"].dt.dayofyear
ds = ds.set_coords("doy")
ds = ds.swap_dims({"time": "doy"})
ds = ds.stack(doystep=["doy", "step"])
ds

@pimmeerdink
Copy link
Collaborator Author

Great! That worked, thanks a lot :)

@BSchilperoort
Copy link
Contributor

I had a discussion with @pimmeerdink yesterday, and the conclusion was that:

  • If the trend and climatology are a function of the step as well. E.g. slope = F(step, latitude, longitude) no changes have to be made to the preprocessor, as it is already supported.
  • If the step dimension should be flattened for the trend or climatology, different changes will have to be made.

@semvijverberg
Copy link
Member

That is nice, curious to see a code snippet!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants