Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt freq in training #1407

Merged
merged 36 commits into from
Oct 18, 2023
Merged

Adapt freq in training #1407

merged 36 commits into from
Oct 18, 2023

Conversation

coxipi
Copy link
Contributor

@coxipi coxipi commented Jun 29, 2023

Pull Request Checklist:

  • This PR addresses an already opened issue (for bug fixes / features)
    • This PR fixes #xyz
  • Tests for the changes have been added (for bug fixes / features)
    • (If applicable) Documentation has been added / updated (for bug fixes / features)
  • CHANGES.rst has been updated (with summary of main changes)
    • Link to issue (:issue:number) and pull request (:pull:number) has been added

What kind of change does this PR introduce?

  • Frequency adaptation should sometimes be performed in the same step as training so that adapted grouped block are preserved for the training step. This PR allows this by invoking frequency adaptation directly in training functions.
  • nbstripout in pre-commit now removes the metadata.kernelspec completely, which are annoying differences that come up when editing notebooks with different software (I think). This might be a problem if we want to mix different cell codes, e.g. python and R in a single notebook. If our use case one day evolves towards this, we can revisit this aspect.

Does this PR introduce a breaking change?

Yes. The adaptation "in-training" is simply a new feature. But the adaptation "out-of-training" (or "modular" approach), akin to the old procedure, will also yield different results. This is because a multi-dimensional rank is now used in "_processing.py".

Other information:

I explain the motivation behind this PR (which is also explained below).

Using multi-dimensional rank changes the "out-of-training" adaptation (which already is "more" in lined with the expected scientific computation). But these results are "more" in-line with the scientific computation expected. I say "more" in-line because if you really want the expected results, the adaptation should be done "in-training" so that each map block remains correctly adapted during the training. Otherwise, with the "out-of-training" adaptation we used to do: expand your data in blocks; frequency adapt them; de-expand to original data size; re-expand in blocks; train them. In the re-expansion, you could still have blocks that are not well adapted.

@github-actions github-actions bot added the sdba Issues concerning the sdba submodule. label Jun 29, 2023
xclim/sdba/_processing.py Outdated Show resolved Hide resolved
xclim/sdba/_processing.py Outdated Show resolved Hide resolved
@coxipi
Copy link
Contributor Author

coxipi commented Jun 29, 2023

Just to share some motivations for this change:

Adapt_freq modifies hist so that the dry-day frequency is equal to that of ref (P0>0), or if it's already smaller (P0<=0), then do nothing.

adapt_freq: ref, hist -> ref, hist'

Once this is done, we could re-apply the function and expect nothing to happen, since the frequency is already fixed.

adapt_freq: ref, hist' -> ref, hist'

This is not what happens when there is a window dimension:

group = sdba.Grouper.from_kwargs(**{"group": "time.dayofyear", "window": 31})["group"]
adapt_freq = {"thresh":"1e-6 kg m-2"}
adapt_freq.setdefault("group", group)
hist, pth, dP01 = sdba.processing.adapt_freq(ref, hist, **adapt_freq)
hist, pth, dP02 = sdba.processing.adapt_freq(ref, hist, **adapt_freq)

image

The problem is that we still have values where dP02>0, hence adapt_freq still need to do some fixing in this second iteration.

This happens because we don't keep the window dimension explicitly. The comparison between ref and hist is done with expanded datasets (window, time), but the correction is only done on the reduced dataset with (time). As a result, we can apply some corrections on a given doy, but on the second use of the function, as we re-expand with (window, time), the can still be the same problems as before.

One possible fix is to simply do frequency adaptation with the window argument. This ensures that each doy is well adapted, so each block will be too. This result in potentially more modifications of the datasets on smaller samples.

The other fix (proposed here) is to keep the (window, time) structure explicit, so that each block is adapted. As a result, frequency adaptation is less modular, since we want to call this within a map_blocks of the training functions. There is also perhaps a conceptual flaw: Given that adaptation can work differently on each block, it is not guaranteed that each block is coherent with each other. For example, with a window=31, the doy=3, year=1990 pr value will be used in 31 blocks. The corrected value could differ from block to block. Granted, this is a mathematical trick, and the corrected value will be small values between 0 and the minimum non-zero value in pr. But I think this is still important to highlight.

xclim/sdba/_adjustment.py Outdated Show resolved Hide resolved
Comment on lines 80 to 90
if len(dim) > 1:
temp_dim = get_temp_dimname(sim.dims, "temp")
rank = (
sim.stack(**{temp_dim: dim})
.rank(temp_dim, pct=True)
.unstack(temp_dim)
.transpose(*sim.dims)
.drop_vars([dim_ for dim_ in dim if dim_ not in sim.coords])
)
else:
rank = sim.rank(dim[0], pct=True)
Copy link
Contributor Author

@coxipi coxipi Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I incorporate multi-dim ranking in the sdba.rank function?

I'm thinking :

def rank(...
dim : str | list[str] = "time",
...
):
dim = dim if isinstance(dim,list) else [dim]
# replace rnk = da.rank(dim, pct=pct) with code above (using pct=pct) 
# rest of the function

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea.

@coxipi
Copy link
Contributor Author

coxipi commented Jun 30, 2023

I should also note: Given the new multi-dimensional rank structure, even if we choose the same individual window value at the end, the processing on sim_ad is different because the ranks are different on this specific window value. Apparently, even without using the internal call to adapt_freq in the training function, the more modular call to adapt_freq now does a better job at creating a new sim_ad whose adaptation persists if we re-perform the adapt_freq routine:

group = sdba.Grouper("time.dayofyear", window=31)
sim_ad, pth, dP0        = sdba.processing.adapt_freq(prref, prsim, thresh="1 mm d-1", group=group)
sim_ad, pth, dP0_repeat = sdba.processing.adapt_freq(prref, sim_ad, thresh="1 mm d-1", group=group)

image

image

I can "sort of" understand why this the case: Take the adaptation of doy=15, with a window=31. As you perform the adaptation on the block centered on doy=15, you will really change the values for doy=15, but the multi-dimensional aspect of the new rank structure make this correction aware of neighbouring doys involved in the same block. But, conversely, doy=15 should be involved in neighbouring blocks as well. I'm confused why this works so well, I would really expect that you generally can't have a modified dataset with dimension (time) that, once re-expanded in (time, window), maintains dP0<=0 in every block. But, as I said, having the rank working on (time, window) surely helps.

At any rate, it is not exactly the same. I compared the "# 2nd try with adapt_freq" cell in "sdba.ipynb" with, "in-training approach" vs. "modular approach" with time.dayofyear and window=31, and there are some small, not-so-frequent differences comparing the scen_ad
image

This is case where no adapt_freq outright fails with singular peaks. In the example time.month shown in current notebooks, I don't see such differences as above.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@github-actions github-actions bot added the docs Improvements to documenation label Aug 2, 2023
@coxipi coxipi marked this pull request as ready for review August 2, 2023 21:08
@coxipi coxipi added the approved Approved for additional tests label Aug 10, 2023
@Zeitsperre Zeitsperre requested a review from aulemahal September 5, 2023 15:45
Copy link
Collaborator

@aulemahal aulemahal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks a lot!

And sorry for not reviewing earlier, it slipped through my radar and under the cracks.

xclim/sdba/_adjustment.py Outdated Show resolved Hide resolved
Comment on lines 80 to 90
if len(dim) > 1:
temp_dim = get_temp_dimname(sim.dims, "temp")
rank = (
sim.stack(**{temp_dim: dim})
.rank(temp_dim, pct=True)
.unstack(temp_dim)
.transpose(*sim.dims)
.drop_vars([dim_ for dim_ in dim if dim_ not in sim.coords])
)
else:
rank = sim.rank(dim[0], pct=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea.

xclim/sdba/adjustment.py Show resolved Hide resolved
CHANGES.rst Outdated Show resolved Hide resolved
@github-actions github-actions bot added the CI Automation and Contiunous Integration label Oct 17, 2023
@coxipi coxipi merged commit 316566a into master Oct 18, 2023
12 checks passed
@coxipi coxipi deleted the adapt_freq_in_training branch October 18, 2023 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Approved for additional tests CI Automation and Contiunous Integration docs Improvements to documenation sdba Issues concerning the sdba submodule.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants