Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill gaps limited 7665 #9402

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
bbfc476
Introduce new arguments limit_direction, limit_area, limit_use coordi…
Ockenfuss Jun 10, 2024
16cdf30
Use internal broadcasting and transpose instead of ones_like
Ockenfuss Jun 10, 2024
43b7165
Typo: Default False in doc for limit_use_coordinates
Ockenfuss Jun 10, 2024
d46baa4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 10, 2024
1fb7795
Towards masked implementation
Ockenfuss Jun 11, 2024
878e6bb
Working fill_gaps implementation
Ockenfuss Jun 20, 2024
1ac5e9c
Remove keep_attrs from docstring of filling functions
Ockenfuss Aug 23, 2024
97b00a4
Fix typos, undo empty spaces, remove temporarily introduced arguments
Ockenfuss Aug 23, 2024
1626489
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 23, 2024
b3d70d6
Add line break for readability
Ockenfuss Aug 24, 2024
6b4c0f7
Enforce kwargs to be passed by name
Ockenfuss Aug 24, 2024
84fe728
Keep_Attrs: Default to True
Ockenfuss Aug 24, 2024
3bbd6da
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 24, 2024
3ec34bf
Explicitly add fill functions in GapMask object
Ockenfuss Aug 25, 2024
a64809a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 25, 2024
92c6b2a
Add type hints to most arguments, return types
Ockenfuss Aug 25, 2024
5452df1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 25, 2024
2f53449
Fix accidental double pasting of arguments
Ockenfuss Aug 25, 2024
3696d63
Fix more mypy errors
Ockenfuss Aug 25, 2024
d5d56ae
Bottleneck is required for limit functionality
Ockenfuss Aug 25, 2024
7570b62
Docs: Require numbagg or bottleneck for ffill/bfill/fill_gaps
Ockenfuss Aug 26, 2024
f19f626
Rework index conversion to have consistent typing
Ockenfuss Aug 26, 2024
b5025ff
Add new method to api.rst
Ockenfuss Oct 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,7 @@ Missing value handling
Dataset.fillna
Dataset.ffill
Dataset.bfill
Dataset.fill_gaps
Dataset.interpolate_na
Dataset.where
Dataset.isin
Expand Down Expand Up @@ -357,6 +358,7 @@ Missing value handling
DataArray.fillna
DataArray.ffill
DataArray.bfill
DataArray.fill_gaps
DataArray.interpolate_na
DataArray.where
DataArray.isin
Expand Down
180 changes: 161 additions & 19 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@
from xarray.backends import ZarrStore
from xarray.backends.api import T_NetcdfEngine, T_NetcdfTypes
from xarray.core.groupby import DataArrayGroupBy
from xarray.core.missing import GapMask
from xarray.core.resample import DataArrayResample
from xarray.core.rolling import DataArrayCoarsen, DataArrayRolling
from xarray.core.types import (
Expand All @@ -103,6 +104,8 @@
ErrorOptions,
ErrorOptionsWithWarn,
InterpOptions,
LimitAreaOptions,
LimitDirectionOptions,
PadModeOptions,
PadReflectOptions,
QuantileMethods,
Expand All @@ -113,6 +116,7 @@
SideOptions,
T_ChunkDimFreq,
T_ChunksFreq,
T_GapLength,
T_Xarray,
)
from xarray.core.weighted import DataArrayWeighted
Expand Down Expand Up @@ -3476,27 +3480,19 @@ def fillna(self, value: Any) -> Self:

def interpolate_na(
self,
dim: Hashable | None = None,
dim: Hashable,
Ockenfuss marked this conversation as resolved.
Show resolved Hide resolved
method: InterpOptions = "linear",
limit: int | None = None,
use_coordinate: bool | str = True,
max_gap: (
None
| int
| float
| str
| pd.Timedelta
| np.timedelta64
| datetime.timedelta
) = None,
use_coordinate: bool | Hashable = True,
max_gap: T_GapLength | None = None,
keep_attrs: bool | None = None,
**kwargs: Any,
) -> Self:
"""Fill in NaNs by interpolating according to different methods.

Parameters
----------
dim : Hashable or None, optional
dim : Hashable
Specifies the dimension along which to interpolate.
method : {"linear", "nearest", "zero", "slinear", "quadratic", "cubic", "polynomial", \
"barycentric", "krogh", "pchip", "spline", "akima"}, default: "linear"
Expand All @@ -3511,17 +3507,17 @@ def interpolate_na(
- 'barycentric', 'krogh', 'pchip', 'spline', 'akima': use their
respective :py:class:`scipy.interpolate` classes.

limit : int or None, default: None
Maximum number of consecutive NaNs to fill. Must be greater than 0
or None for no limit. This filling is done regardless of the size of
the gap in the data. To only interpolate over gaps less than a given length,
see ``max_gap``.
use_coordinate : bool or str, default: True
Specifies which index to use as the x values in the interpolation
formulated as `y = f(x)`. If False, values are treated as if
equally-spaced along ``dim``. If True, the IndexVariable `dim` is
used. If ``use_coordinate`` is a string, it specifies the name of a
coordinate variable to use as the index.
limit : int or None, default: None
Maximum number of consecutive NaNs to fill. Must be greater than 0
or None for no limit. This filling is done regardless of the size of
the gap in the data. To only interpolate over gaps less than a given length,
see ``max_gap``.
max_gap : int, float, str, pandas.Timedelta, numpy.timedelta64, datetime.timedelta, default: None
Maximum size of gap, a continuous sequence of NaNs, that will be filled.
Use None for no limit. When interpolating along a datetime64 dimension
Expand Down Expand Up @@ -3567,6 +3563,7 @@ def interpolate_na(
>>> da = xr.DataArray(
... [np.nan, 2, 3, np.nan, 0], dims="x", coords={"x": [0, 1, 2, 3, 4]}
... )

>>> da
<xarray.DataArray (x: 5)> Size: 40B
array([nan, 2., 3., nan, 0.])
Expand Down Expand Up @@ -3601,7 +3598,7 @@ def interpolate_na(
def ffill(self, dim: Hashable, limit: int | None = None) -> Self:
"""Fill NaN values by propagating values forward

*Requires bottleneck.*
*Requires numbagg or bottleneck.*

Parameters
----------
Expand Down Expand Up @@ -3685,7 +3682,7 @@ def ffill(self, dim: Hashable, limit: int | None = None) -> Self:
def bfill(self, dim: Hashable, limit: int | None = None) -> Self:
"""Fill NaN values by propagating values backward

*Requires bottleneck.*
*Requires numbagg or bottleneck.*

Parameters
----------
Expand Down Expand Up @@ -3766,6 +3763,151 @@ def bfill(self, dim: Hashable, limit: int | None = None) -> Self:

return bfill(self, dim, limit=limit)

def fill_gaps(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts from others on the naming? Would .fill be insufficiently specific that it's filling na? Would fill_missing be clearer than fill_gaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to any of those. .fill sounds very concise, but maybe this is easily confused with .ffill

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think .fill could be quite nice, do others have a view?

Copy link
Contributor

@dcherian dcherian Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe gap_filler instead, since this method does not actually fill the gaps.

I'm also wondering if its better to have a method that constructs the appropriate mask that can be used later

mask = ds.get_gap_mask(max_gap=...)
ds.ffill(...).where(~mask)

Copy link
Contributor Author

@Ockenfuss Ockenfuss Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points! Just to explain:

gap_filler emphasizes the returned object type nicely! However, I choose fill_gaps because it fits the naming scheme of other object-returning functions better (e.g. rolling and coarsen are not called roller and coarser in xarray, even though the operation is not perfomed immediately and an object is returned).
Ultimately (I am a non-native english speaker) I am happy for any recommendations regarding nomenclature.
If you prefer gap_filler, I will change accordingly.

The function API is also presented as an alternative in the initial proposal. I decided to go for the object way because it is shorter (one line) and less error prone (you might easily forget the ~). If the mask is required, you can easily get it from the object:

mask=ds.gap_filler(...).mask

self,
dim: Hashable,
*,
use_coordinate: bool | Hashable = True,
limit: T_GapLength | None = None,
limit_direction: LimitDirectionOptions = "both",
limit_area: LimitAreaOptions | None = None,
max_gap: T_GapLength | None = None,
) -> GapMask[DataArray]:
"""Fill in gaps (consecutive missing values) in the data using one of several filling methods.
Allows for fine control on how far to extend the valid data into the gaps and the maximum size of the gaps to fill.

*Requires numbagg or bottleneck.*

Parameters
----------
dim : Hashable
Specifies the dimension along which to calculate gap sizes.
use_coordinate : bool or Hashable, default: True
Specifies which index to use when calculating gap sizes.

- False: a consecutive integer index is created along ``dim`` (0, 1, 2, ...).
- True: the IndexVariable `dim` is used.
- String: specifies the name of a coordinate variable to use as the index.

limit : int, float, str, pandas.Timedelta, numpy.timedelta64, datetime.timedelta, default: None
Maximum number or distance of consecutive NaNs to fill.
Use None for no limit. When interpolating along a datetime64 dimension
and ``use_coordinate=True``, ``limit`` can be one of the following:

- a string that is valid input for pandas.to_timedelta
- a :py:class:`numpy.timedelta64` object
- a :py:class:`pandas.Timedelta` object
- a :py:class:`datetime.timedelta` object

Otherwise, ``limit`` must be an int or a float.
If ``use_coordinates=True``, for ``limit_direction=forward`` distance is defined
as the difference between the coordinate at a NaN value and the coordinate of the next valid value
to the left (right for ``limit_direction=backward``).
For example, consider::

<xarray.DataArray (x: 9)>
array([nan, nan, nan, 1., nan, nan, 4., nan, nan])
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8

For ``limit_direction=forward``, distances are ``[nan, nan, nan, 0, 1, 2, 0, 1, 2]``.
To only fill gaps less than a given length,
see ``max_gap``.
limit_direction: {"forward", "backward", "both"}, default: "forward"
Consecutive NaNs will be filled in this direction.
limit_area: {"inside", "outside"} or None: default: None
Consecutive NaNs will be filled with this restriction.

- None: No fill restriction.
- "inside": Only fill NaNs surrounded by valid values
- "outside": Only fill NaNs outside valid values (extrapolate).
max_gap : int, float, str, pandas.Timedelta, numpy.timedelta64, datetime.timedelta, default: None
Maximum size of gap, a continuous sequence of NaNs, that will be filled.
Use None for no limit. When calculated along a datetime64 dimension
and ``use_coordinate=True``, ``max_gap`` can be one of the following:

- a string that is valid input for pandas.to_timedelta
- a :py:class:`numpy.timedelta64` object
- a :py:class:`pandas.Timedelta` object
- a :py:class:`datetime.timedelta` object

Otherwise, ``max_gap`` must be an int or a float. If ``use_coordinate=False``, a linear integer
index is created. Gap length is defined as the difference
between coordinate values at the first data point after a gap and the last valid value
before a gap. For gaps at the beginning (end), gap length is defined as the difference
between coordinate values at the first (last) valid data point and the first (last) NaN.
For example, consider::

<xarray.DataArray (x: 9)>
array([nan, nan, nan, 1., nan, nan, 4., nan, nan])
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8

The gap lengths are 3-0 = 3; 6-3 = 3; and 8-6 = 2 respectively

Returns
-------
Gap Mask: GapMask
An object where all remaining gaps are masked. Unmasked values can be filled by calling any of the provided methods.

See Also
--------
DataArray.fillna
DataArray.ffill
DataArray.bfill
DataArray.interpolate_na
pandas.DataFrame.interpolate

Notes
-----
``Limit`` and ``max_gap`` have different effects on gaps: If ``limit`` is set, *some* values in a gap will be filled (up to the given distance from the boundaries). ``max_gap`` will prevent *any* filling for gaps larger than the given distance.

Examples
--------
>>> da = xr.DataArray(
... [np.nan, 2, np.nan, np.nan, 5, np.nan, 0],
... dims="x",
... coords={"x": [0, 1, 2, 3, 4, 5, 6]},
... )

>>> da
<xarray.DataArray (x: 7)> Size: 56B
array([nan, 2., nan, nan, 5., nan, 0.])
Coordinates:
* x (x) int64 56B 0 1 2 3 4 5 6

>>> da.fill_gaps(dim="x", limit=1, limit_direction="forward").interpolate_na(
... dim="x"
... )
<xarray.DataArray (x: 7)> Size: 56B
array([nan, 2. , 3. , nan, 5. , 2.5, 0. ])
Coordinates:
* x (x) int64 56B 0 1 2 3 4 5 6

>>> da.fill_gaps(dim="x", max_gap=2, limit_direction="forward").ffill(dim="x")
<xarray.DataArray (x: 7)> Size: 56B
array([nan, 2., nan, nan, 5., 5., 0.])
Coordinates:
* x (x) int64 56B 0 1 2 3 4 5 6

>>> da.fill_gaps(dim="x", limit_area="inside").fillna(9)
Ockenfuss marked this conversation as resolved.
Show resolved Hide resolved
<xarray.DataArray (x: 7)> Size: 56B
array([nan, 2., 9., 9., 5., 9., 0.])
Coordinates:
* x (x) int64 56B 0 1 2 3 4 5 6
"""
from xarray.core.missing import mask_gaps

return mask_gaps(
self,
dim,
use_coordinate=use_coordinate,
limit=limit,
limit_direction=limit_direction,
limit_area=limit_area,
max_gap=max_gap,
)

def combine_first(self, other: Self) -> Self:
"""Combine two DataArray objects, with union of coordinates.

Expand Down
Loading