-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SPI performance #1311
Improve SPI performance #1311
Conversation
I think the computation is still quite heavy. A few things should help. There is less redundant computation (we can either reuse params, or at least we don't resample/roll more or less twice (like it used to do for The user can now dissect the computation more easily:
|
@declare_units( | ||
pr="[precipitation]", | ||
pr_cal="[precipitation]", | ||
params="[]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yes, params
needs an entry in declare_units
if it is a Quantified
.
xclim/indices/_agro.py
Outdated
|
||
def wrap_fit(da): | ||
# if all values are null, obtain a template of the fit and broadcast | ||
if indexer != {} and da.isnull().all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, how does this work with dask ? Won't da.isnull().all()
as a conditional trigger the computation ?
And why indexer != {}
? Why would this fastpath only be used with indexing ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about Dask. There is no way then to achieve a fastpath with lazy computing, no?
For the other question: Yes, indeed, spatial regions with only NaNs independently of time selecting would also benefit from this speedup (ignoring the Dask issue). I wanted to reserve this check for cases when I was sure there was a potential of all-NaN slices, e.g. when there was time-selecting. I'm not sure how costly it is to check .isnull.all, but probably very small in comparison to the whole algorithm.
xclim/indices/_agro.py
Outdated
return np.zeros_like(da) * np.NaN | ||
|
||
spi = np.zeros_like(da) | ||
# loop over groups (similar to np.groupies idea, but np.groupies can't be used with custom functions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the doc (https://github.com/ml31415/numpy-groupies/), numpy groupies accepts a callable as func
would it make sense to use that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had tried but got a confusing error message leading me to believe this claim was maybe false. Maybe I just messed up. But I will retry, it would be worth it to clean the function.
xclim/indices/_agro.py
Outdated
params: xarray.DataArray | ||
Fit parameters. The `params` can be computed using ``xclim.indices.standardized_index_fit_params`` in advance. | ||
The ouput can be given here as input, and it overrides other options. | ||
offset: Quantified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion:
offset: Quantified | |
offset: Quantified | None = None, |
# pseudo-code
if offset is None:
if params is not None:
if dist in bounded_distributions:
offset = 1000 mm / d
else:
offset = 0
else:
offset = params.offset
else:
if params is not None and params.offset != offset:
warning
C'est à dire : put a default of None and clearly explain in the docstring what the default behaviour is. I know I just suggested something different for the snow thing, but that was in a "temporary" perspective!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, could there some option where offset = -wb.min()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check, but I think actually, there is not a restriction in scipy to use gamma/fisk distrubutions with 0 bounded data. A parameter can also be fitted for this offset. This is was an idea for another PR though. I think the problem could be: if there are more negative values outside of the calibration period, then the offset could be too weak.
The problem with offset = -wb.min()
is that you can become sensitive to what data is included. Imagine your reference data is 1980-2010 and you fit two cases:
- 1980-2020
- 1980-2050
You could have a different minimum in cases 1 & 2, so different offset. Ideally, since this is a trick, we would like that our computation are not too sensitive on this. But in any case, it would be nice that the results of 1. are en exact subset of 2. if you use the same calibration data / methods for both computations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, I would eventually like to get rid of the default 1 mm/day default. It was just copied on monocongo's climate indices. It's a good rule of thumb, but users have found cases where a bigger offset was needed. Maybe we can keep it in but warn users about having NaNs? Maybe just in the docstring. Not sure. I could check the R implementation to see if I can dig ideas.
Co-authored-by: Pascal Bourgault <[email protected]>
Co-authored-by: Pascal Bourgault <[email protected]>
Showing problems with
|
Sabotaging my previous example to mimick
|
…_spi_performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small comments, but it looks good!
Co-authored-by: Pascal Bourgault <[email protected]>
Co-authored-by: Pascal Bourgault <[email protected]>
Ignore the cancelled builds, merging when docs clear. |
Pull Request Checklist:
number
) and pull request (:pull:number
) has been addedWhat kind of change does this PR introduce?
dist_method
now avoidsvectorize=True
in itsxr.apply_ufunc
. This is the main improvement in SPI/SPEI.Does this PR introduce a breaking change?
Yes.
pr_cal
orwb_cal
will not be input options in the future:I could revert this breaking change if we prefer. This was a first attempt to make the computation faster, but the improvements are now independent of this change. We could also keep the modular structure for params, but revert to
pr_cal
instead ofcal_range
. It's a bit less efficient whenpr_cal
is simply a subset ofpr
, because you end up doing resampling/rolling two times on the calibration range for nothing. When first computingparams
, then obtainingspi
in two steps, then it makes no differenceOther information: