Skip to content

Commit

Permalink
dOTC (#1787)
Browse files Browse the repository at this point in the history
<!--Please ensure the PR fulfills the following requirements! -->
<!-- If this is your first PR, make sure to add your details to the
AUTHORS.rst! -->
### Pull Request Checklist:
- [ ] This PR addresses an already opened issue (for bug fixes /
features)
    - This PR fixes #xyz
- [x] Tests for the changes have been added (for bug fixes / features)
- [x] (If applicable) Documentation has been added / updated (for bug
fixes / features)
- [x] CHANGES.rst has been updated (with summary of main changes)
- [x] Link to issue (:issue:`number`) and pull request (:pull:`number`)
has been added

### What kind of change does this PR introduce?

* Introduce
[dOTC](https://hess.copernicus.org/articles/23/773/2019/hess-23-773-2019.pdf)
bias correction method into sdba.

### Does this PR introduce a breaking change?
No

### Other information:
Am I doing this right?
  • Loading branch information
LamAdr authored Aug 7, 2024
2 parents 822d477 + 73aa04c commit 721375f
Show file tree
Hide file tree
Showing 10 changed files with 1,432 additions and 14 deletions.
4 changes: 4 additions & 0 deletions .zenodo.json
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,10 @@
"affiliation": "Santander Meteorology Group, Instituto de Física de Cantabria (CSIC-UC), Santander, Spain",
"orcid": "0000-0001-9053-2542"
},
{
"name": "Lamarche, Adrien",
"orcid": "0009-0002-9078-8358"
},
{
"name": "Wang, Hui-Min",
"affiliation": "National University of Singapore, Singapore, Singapore",
Expand Down
3 changes: 2 additions & 1 deletion AUTHORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,5 @@ Contributors
* Dante Castro <[email protected]> `@profesorpaiche <https://github.com/profesorpaiche>`_
* Sascha Hofmann <[email protected]> `@saschahofmann <https://github.com/saschahofmann>`_
* Javier Diez-Sierra <[email protected]> `@JavierDiezSierra <https://github.com/JavierDiezSierra>`_
* Hui-Min Wang `@Hem-W <https://github.com/Hem-W>`
* Hui-Min Wang `@Hem-W <https://github.com/Hem-W>`_
* Adrien Lamarche `@LamAdr <https://github.com/LamAdr>`_
3 changes: 2 additions & 1 deletion CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Changelog

v0.52.0 (unreleased)
--------------------
Contributors to this version: David Huard (:user:`huard`), Trevor James Smith (:user:`Zeitsperre`), Hui-Min Wang (:user:`Hem-W`), Éric Dupuis (:user:`coxipi`), Sarah Gammon (:user:`SarahG-579462`), Pascal Bourgault (:user:`aulemahal`), Juliette Lavoie (:user:`juliettelavoie`).
Contributors to this version: David Huard (:user:`huard`), Trevor James Smith (:user:`Zeitsperre`), Hui-Min Wang (:user:`Hem-W`), Éric Dupuis (:user:`coxipi`), Sarah Gammon (:user:`SarahG-579462`), Pascal Bourgault (:user:`aulemahal`), Juliette Lavoie (:user:`juliettelavoie`), Adrien Lamarche (:user:`LamAdr`).

Announcements
^^^^^^^^^^^^^
Expand All @@ -15,6 +15,7 @@ New features and enhancements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* ``xclim.sdba.nbutils.quantile`` and its child functions are now faster. If the module `fastnanquantile` is installed, it is used as the backend for the computation of quantiles and yields even faster results. This dependency is now listed in the `xclim[extras]` recipe. (:issue:`1255`, :pull:`1513`).
* New multivariate bias adjustment class ``MBCn``, giving a faster and more accurate implementation of the ``MBCn`` algorithm. (:issue:`1551`, :pull:`1580`).
* New multivariate bias adjustment classes ``OTC`` and ``dOTC``. (:pull:`1787`).
* `xclim` is now compatible with `pytest` versions `>=8.0.0`. (:pull:`1632`).

Breaking changes
Expand Down
226 changes: 218 additions & 8 deletions docs/notebooks/sdba.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -344,6 +344,7 @@
"metadata": {},
"source": [
"### Third example : Multi-method protocol - Hnilica et al. 2017\n",
"\n",
"In [their paper of 2017](https://doi.org/10.1002/joc.4890), Hnilica, Hanel and Puš present a bias-adjustment method based on the principles of Principal Components Analysis.\n",
"\n",
"The idea is simple: use principal components to define coordinates on the reference and on the simulation, and then transform the simulation data from the latter to the former. Spatial correlation can thus be conserved by taking different points as the dimensions of the transform space. The method was demonstrated in the article by bias-adjusting precipitation over different drainage basins.\n",
Expand Down Expand Up @@ -439,9 +440,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fourth example : Multivariate bias-adjustment (Cannon, 2018)\n",
"### Fourth example : Dynamical Optimal Transport Correction - Robin et al. 2019\n",
"Robin, Vrac, Naveau and Yiou presented the dOTC multivariate bias correction method in a [2019 paper](https://hess.copernicus.org/articles/23/773/2019/).\n",
"\n",
"This section replicates the \"MBCn\" algorithm described by [Cannon (2018)](https://doi.org/10.1007/s00382-017-3580-6). The method relies on some univariate algorithm, an adaption of the N-pdf transform of [Pitié et al. (2005)](https://ieeexplore.ieee.org/document/1544887/) and a final reordering step.\n",
"Here, we use optimal transport to find mappings between reference, simulated historical and simulated future data. Following these mappings, future simulation is corrected by applying the temporal evolution of model data to the reference.\n",
"\n",
"In the following, we use the Adjusted and Homogenized Canadian Climate Dataset ([AHCCD](https://open.canada.ca/data/en/dataset/9c4ebc00-3ea4-4fe0-8bf2-66cfe1cddd1d)) and CanESM2 data as reference and simulation, respectively, and correct both `pr` and `tasmax` together."
]
Expand All @@ -452,12 +454,14 @@
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"from xclim.core.units import convert_units_to\n",
"from xclim.testing import open_dataset\n",
"\n",
"dref = open_dataset(\n",
" \"sdba/ahccd_1950-2013.nc\", chunks={\"location\": 1}, drop_variables=[\"lat\", \"lon\"]\n",
").sel(time=slice(\"1981\", \"2010\"))\n",
"dref = open_dataset(\"sdba/ahccd_1950-2013.nc\", drop_variables=[\"lat\", \"lon\"]).sel(\n",
" time=slice(\"1981\", \"2010\")\n",
")\n",
"\n",
"# Fix the standard name of the `pr` variable.\n",
"# This allows the convert_units_to below to infer the correct CF transformation (precip rate to flux)\n",
Expand All @@ -468,12 +472,218 @@
" tasmax=convert_units_to(dref.tasmax, \"K\"),\n",
" pr=convert_units_to(dref.pr, \"kg m-2 s-1\"),\n",
")\n",
"dsim = open_dataset(\n",
" \"sdba/CanESM2_1950-2100.nc\", chunks={\"location\": 1}, drop_variables=[\"lat\", \"lon\"]\n",
")\n",
"dsim = open_dataset(\"sdba/CanESM2_1950-2100.nc\", drop_variables=[\"lat\", \"lon\"])\n",
"\n",
"dhist = dsim.sel(time=slice(\"1981\", \"2010\"))\n",
"dsim = dsim.sel(time=slice(\"2041\", \"2070\"))\n",
"dref"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we are going to correct the precipitations multiplicatively to make sure they don't become negative. In this context, small precipitation values can lead to huge aberrations. This problem can be mitigated with `adapt_freq_thresh`. We also need to stack our variables into a `dataArray` before feeding them to `dOTC`.\n",
"\n",
"Since the precipitations are treated multiplicatively, we have no choice but to use \"std\" for the `cov_factor` argument (the default), which means the rescaling of model data to the observed data scale is done independently for every variable. In the situation where one only has additive variables, it is recommended to use the \"cholesky\" `cov_factor`, in which case the rescaling is done in a multivariate fashion."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ref = dref.where(dref.location == \"Amos\", drop=True).squeeze()\n",
"hist = dhist.where(dhist.location == \"Amos\", drop=True).squeeze()\n",
"sim = dsim.where(dsim.location == \"Amos\", drop=True).squeeze()\n",
"\n",
"ref = sdba.processing.stack_variables(ref)\n",
"hist = sdba.processing.stack_variables(hist)\n",
"sim = sdba.processing.stack_variables(sim)\n",
"\n",
"# This function has random components\n",
"np.random.seed(0)\n",
"\n",
"# Contrary to most algorithms in sdba, dOTC has no `train` method\n",
"scen = sdba.adjustment.dOTC.adjust(\n",
" ref,\n",
" hist,\n",
" sim,\n",
" kind={\n",
" \"pr\": \"*\"\n",
" }, # Since this bias correction method is multivariate, `kind` must be specified per variable\n",
" adapt_freq_thresh={\"pr\": \"3.5e-4 kg m-2 s-1\"}, # Idem\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Some analysis figures\n",
"\n",
"# Unstack variables and select a location\n",
"ref = sdba.processing.unstack_variables(ref)\n",
"hist = sdba.processing.unstack_variables(hist)\n",
"sim = sdba.processing.unstack_variables(sim)\n",
"scen = sdba.processing.unstack_variables(scen)\n",
"\n",
"fig = plt.figure(figsize=(10, 10))\n",
"gs = plt.matplotlib.gridspec.GridSpec(2, 2, fig)\n",
"ax_pr = plt.subplot(gs[0, 0])\n",
"ax_tasmax = plt.subplot(gs[0, 1])\n",
"ax_scatter = plt.subplot(gs[1, :])\n",
"\n",
"# Precipitation\n",
"hist.pr.plot(ax=ax_pr, color=\"c\", label=\"Simulation (past)\")\n",
"ref.pr.plot(ax=ax_pr, color=\"b\", label=\"Reference\", alpha=0.5)\n",
"sim.pr.plot(ax=ax_pr, color=\"y\", label=\"Simulation (future)\")\n",
"scen.pr.plot(ax=ax_pr, color=\"r\", label=\"Corrected\", alpha=0.5)\n",
"ax_pr.set_title(\"Precipitation\")\n",
"\n",
"# Maximum temperature\n",
"hist.tasmax.plot(ax=ax_tasmax, color=\"c\")\n",
"ref.tasmax.plot(ax=ax_tasmax, color=\"b\", alpha=0.5)\n",
"sim.tasmax.plot(ax=ax_tasmax, color=\"y\")\n",
"scen.tasmax.plot(ax=ax_tasmax, color=\"r\", alpha=0.5)\n",
"ax_tasmax.set_title(\"Maximum temperature\")\n",
"\n",
"# Scatter\n",
"ref.plot.scatter(x=\"tasmax\", y=\"pr\", ax=ax_scatter, color=\"b\", edgecolors=\"k\", s=20)\n",
"scen.plot.scatter(x=\"tasmax\", y=\"pr\", ax=ax_scatter, color=\"r\", edgecolors=\"k\", s=20)\n",
"sim.plot.scatter(x=\"tasmax\", y=\"pr\", ax=ax_scatter, color=\"y\", edgecolors=\"k\", s=20)\n",
"hist.plot.scatter(x=\"tasmax\", y=\"pr\", ax=ax_scatter, color=\"c\", edgecolors=\"k\", s=20)\n",
"ax_scatter.set_title(\"Variables distribution\")\n",
"\n",
"# Example mapping\n",
"max_time = scen.pr.idxmax().data\n",
"max_idx = np.where(scen.time.data == max_time)[0][0]\n",
"\n",
"scen_x = scen.tasmax.isel(time=max_idx)\n",
"scen_y = scen.pr.isel(time=max_idx)\n",
"sim_x = sim.tasmax.isel(time=max_idx)\n",
"sim_y = sim.pr.isel(time=max_idx)\n",
"\n",
"ax_scatter.scatter(scen_x, scen_y, color=\"r\", edgecolors=\"k\", s=30, linewidth=1)\n",
"ax_scatter.scatter(sim_x, sim_y, color=\"y\", edgecolors=\"k\", s=30, linewidth=1)\n",
"\n",
"prop = dict(arrowstyle=\"-|>,head_width=0.3,head_length=0.8\", facecolor=\"black\", lw=1)\n",
"ax_scatter.annotate(\"\", xy=(scen_x, scen_y), xytext=(sim_x, sim_y), arrowprops=prop)\n",
"\n",
"ax_pr.legend()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from scipy.stats import gaussian_kde\n",
"\n",
"fig = plt.figure(figsize=(10, 5))\n",
"gs = plt.matplotlib.gridspec.GridSpec(1, 2, fig)\n",
"\n",
"tasmax = plt.subplot(gs[0, 0])\n",
"pr = plt.subplot(gs[0, 1])\n",
"\n",
"sim_t = sim.tasmax.to_numpy()\n",
"scen_t = scen.tasmax.to_numpy()\n",
"stack = np.vstack([sim_t, scen_t])\n",
"z = gaussian_kde(stack)(stack)\n",
"idx = z.argsort()\n",
"sim_t, scen_t, z = sim_t[idx], scen_t[idx], z[idx]\n",
"tasmax.scatter(sim_t, scen_t, c=z, s=1, cmap=\"viridis\")\n",
"tasmax.set_title(\"Tasmax\")\n",
"tasmax.set_ylabel(\"scen tasmax\")\n",
"tasmax.set_xlabel(\"sim tasmax\")\n",
"\n",
"sim_p = sim.pr.to_numpy()\n",
"scen_p = scen.pr.to_numpy()\n",
"stack = np.vstack([sim_p, scen_p])\n",
"z = gaussian_kde(stack)(stack)\n",
"idx = z.argsort()\n",
"sim_p, scen_p, z = sim_p[idx], scen_p[idx], z[idx]\n",
"pr.scatter(sim_p, scen_p, c=z, s=1, cmap=\"viridis\")\n",
"pr.set_title(\"Pr\")\n",
"pr.set_ylabel(\"scen pr\")\n",
"pr.set_xlabel(\"sim pr\")\n",
"\n",
"fig.suptitle(\"Correlations between input and output per variable\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This last plot shows the correlation between input and output per variable. Here we see a relatively strong correlation for all variables, meaning they are all taken into account when finding the optimal transport mappings. This is because we're using the (by default) `transform = 'max_distance'` argument. Were the data not transformed, the distances along the precipitation dimension would be very small relative to the temperature distances. Precipitation values would then be spread around at very low cost and have virtually no effect on the result. See this in action with `transform = None`.\n",
"\n",
"The chunks we see in the tasmax data are artefacts of the `bin_width`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fifth example : Multivariate bias-adjustment with multiple steps (Cannon, 2018)\n",
"\n",
"This section replicates the \"MBCn\" algorithm described by [Cannon (2018)](https://doi.org/10.1007/s00382-017-3580-6). The method relies on some univariate algorithm, an adaption of the N-pdf transform of [Pitié et al. (2005)](https://ieeexplore.ieee.org/document/1544887/) and a final reordering step.\n",
"\n",
"As in the dOTC example, we use the Adjusted and Homogenized Canadian Climate Dataset ([AHCCD](https://open.canada.ca/data/en/dataset/9c4ebc00-3ea4-4fe0-8bf2-66cfe1cddd1d)) and CanESM2 data as reference and simulation, respectively, and correct both `pr` and `tasmax` together. This time, we chunk our data with Dask."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dref[\"pr\"] = dref.pr.chunk({\"location\": 1})\n",
"dref[\"tasmax\"] = dref.tasmax.chunk({\"location\": 1})\n",
"\n",
"dhist[\"pr\"] = dhist.pr.chunk({\"location\": 1})\n",
"dhist[\"tasmax\"] = dhist.tasmax.chunk({\"location\": 1})\n",
"\n",
"dsim[\"pr\"] = dsim.pr.chunk({\"location\": 1})\n",
"dsim[\"tasmax\"] = dsim.tasmax.chunk({\"location\": 1})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Perform an initial univariate adjustment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# additive for tasmax\n",
"QDMtx = sdba.QuantileDeltaMapping.train(\n",
" dref.tasmax, dhist.tasmax, nquantiles=20, kind=\"+\", group=\"time\"\n",
")\n",
"# Adjust both hist and sim, we'll feed both to the Npdf transform.\n",
"scenh_tx = QDMtx.adjust(dhist.tasmax)\n",
"scens_tx = QDMtx.adjust(dsim.tasmax)\n",
"\n",
"# remove == 0 values in pr:\n",
"dref[\"pr\"] = sdba.processing.jitter_under_thresh(dref.pr, \"0.01 mm d-1\")\n",
"dhist[\"pr\"] = sdba.processing.jitter_under_thresh(dhist.pr, \"0.01 mm d-1\")\n",
"dsim[\"pr\"] = sdba.processing.jitter_under_thresh(dsim.pr, \"0.01 mm d-1\")\n",
"\n",
"# multiplicative for pr\n",
"QDMpr = sdba.QuantileDeltaMapping.train(\n",
" dref.pr, dhist.pr, nquantiles=20, kind=\"*\", group=\"time\"\n",
")\n",
"# Adjust both hist and sim, we'll feed both to the Npdf transform.\n",
"scenh_pr = QDMpr.adjust(dhist.pr)\n",
"scens_pr = QDMpr.adjust(dsim.pr)\n",
"\n",
"# Stack variables : Dataset -> DataArray with `multivar` dimension\n",
"dref, dhist, dsim = (sdba.stack_variables(da) for da in (dref, dhist, dsim))"
Expand Down
51 changes: 51 additions & 0 deletions docs/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -2152,3 +2152,54 @@ @article{droogers2002
url = {https://www.scopus.com/inward/record.uri?eid=2-s2.0-0036464359&doi=10.1023%2fA%3a1015508322413&partnerID=40&md5=7322aaa4c6874878f5b1dab3c73c1718},
type = {Article}
}

@Article{robin_2019,
author = {Robin, Y. and Vrac, M. and Naveau, P. and Yiou, P.},
title = {Multivariate stochastic bias corrections with optimal transport},
journal = {Hydrology and Earth System Sciences},
volume = {23},
year = {2019},
number = {2},
pages = {773--786},
url = {https://hess.copernicus.org/articles/23/773/2019/},
doi = {10.5194/hess-23-773-2019}
}

@misc{robin_2021,
title = {{SBCK}: {Statistical} {Bias} {Correction} {Kit}},
copyright = {GPL-3},
shorttitle = {{SBCK}},
url = {https://github.com/yrobink/SBCK-python},
urldate = {2024-07-03},
author = {Robin, Yoann},
year = {2021},
}

@article{higham_1988,
title = {Computing a nearest symmetric positive semidefinite matrix},
journal = {Linear Algebra and its Applications},
volume = {103},
pages = {103-118},
year = {1988},
issn = {0024-3795},
doi = {https://doi.org/10.1016/0024-3795(88)90223-6},
url = {https://www.sciencedirect.com/science/article/pii/0024379588902236},
author = {Nicholas J. Higham},
abstract = {The nearest symmetric positive semidefinite matrix in the Frobenius norm to an arbitrary real matrix A is shown to be (B + H)/2, where H is the symmetric polar factor of B=(A + AT)/2. In the 2-norm a nearest symmetric positive semidefinite matrix, and its distance δ2(A) from A, are given by a computationally challenging formula due to Halmos. We show how the bisection method can be applied to this formula to compute upper and lower bounds for δ2(A) differing by no more than a given amount. A key ingredient is a stable and efficient test for positive definiteness, based on an attempted Choleski decomposition. For accurate computation of δ2(A) we formulate the problem as one of zero finding and apply a hybrid Newton-bisection algorithm. Some numerical difficulties are discussed and illustrated by example.}
}

@article{knol_1989,
title = "Least-squares approximation of an improper correlation matrix by a proper one",
abstract = "An algorithm is presented for the best least-squares fitting correlation matrix approximating a given missing value or improper correlation matrix. The proposed algorithm is based upon a solution for Mosier's oblique Procrustes rotation problem offered by ten Berge and Nevels. A necessary and sufficient condition is given for a solution to yield the unique global minimum of the least-squares function. Empirical verification of the condition indicates that the occurrence of non-optimal solutions with the proposed algorithm is very unlikely. A possible drawback of the optimal solution is that it is a singular matrix of necessity. In cases where singularity is undesirable, one may impose the additional nonsingularity constraint that the smallest eigenvalue of the solution be δ, where δ is an arbitrary small positive constant. Finally, it may be desirable to weight the squared errors of estimation differentially. A generalized solution is derived which satisfies the additional nonsingularity constraint and also allows for weighting. The generalized solution can readily be obtained from the standard “unweighted singular” solution by transforming the observed improper correlation matrix in a suitable way.",
keywords = "Missing value correlation, indefinite correlation matrix, IR-85889, tetrachoric correlation, constrained least-squares approximation",
author = "Knol, {Dirk L.} and {ten Berge}, {Jos M.F.}",
year = "1989",
doi = "10.1007/BF02294448",
language = "Undefined",
volume = "54",
pages = "53--61",
journal = "Psychometrika",
issn = "0033-3123",
publisher = "Springer",
number = "1",
}
5 changes: 3 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ docs = [
"sphinxcontrib-bibtex",
"sphinxcontrib-svg2pdfconverter[Cairosvg]"
]
extras = ["fastnanquantile >=0.0.2"]
extras = ["fastnanquantile >=0.0.2", "POT"]
all = ["xclim[dev]", "xclim[docs]", "xclim[extras]"]

[project.scripts]
Expand Down Expand Up @@ -166,7 +166,7 @@ values = [

[tool.codespell]
skip = 'xclim/data/*.json,docs/_build,docs/notebooks/xclim_training/*.ipynb,docs/references.bib,__pycache__,*.gz,*.nc,*.png,*.svg,*.whl'
ignore-words-list = "absolue,astroid,bloc,bui,callendar,degreee,environnement,hanel,inferrable,lond,nam,nd,ressources,socio-economic,sie,vas"
ignore-words-list = "absolue,astroid,bloc,bui,callendar,degreee,environnement,hanel,inferrable,lond,nam,nd,ot,ressources,socio-economic,sie,vas"

[tool.coverage.run]
relative_files = true
Expand All @@ -180,6 +180,7 @@ pep621_dev_dependency_groups = ["all", "dev", "docs"]
[tool.deptry.package_module_name_map]
"scikit-learn" = "sklearn"
"pyyaml" = "yaml"
"POT" = "ot"

[tool.deptry.per_rule_ignores]
DEP001 = ["SBCK"]
Expand Down
Loading

0 comments on commit 721375f

Please sign in to comment.