Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill gaps limited 7665 #9402

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

Ockenfuss
Copy link
Contributor

@Ockenfuss Ockenfuss commented Aug 23, 2024

Improve gap filling

This PR involves a full implementation, documentation and corresponding tests. As discussed in #7665, a new API function fill_gaps was introduced, to avoid breaking changes of existing code and keep the argument inflation of the filling functions to a minimum.

TBD if accepted:

  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst
  • Add Example in Documentation

Deprecations

  • Using limit in interpolate_na now requires numbagg or bottleneck. So far, limit worked without bottleneck, max_gap required it (not documented). Reason: max_gap and limit now share the same codebase using ffill. Since no one ever complained about the missing documentation and the error message is pretty clear and easy to resolve ("No module named 'bottleneck"), I hope this might be acceptable.

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems extremely good!!

I left some refine-y comments, nothing preclusive.

What do others think? I don't use interpolate_na myself much so other may have views...

xarray/core/dataarray.py Show resolved Hide resolved
Comment on lines +6943 to +6944
limit_direction: {"forward", "backward", "both"}, default: "forward"
Consecutive NaNs will be filled in this direction.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we feel about this param vs. the ffill / bfill on the returned object? Do we need both?

Copy link
Contributor Author

@Ockenfuss Ockenfuss Aug 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is actually useful to have this parameter and allow to set it independently of the filing function. Then, for example, you can do things like:

n = np.nan
test=xr.DataArray([1,1, n, n, n, 2, 2], dims=["x"])
test.fill_gaps('x', limit=1, limit_direction='backward').ffill('x')

which effectively pulls the gap left edge towards the gap right edge:

<xarray.DataArray (x: 7)> Size: 56B
array([ 1.,  1., nan, nan,  1.,  2.,  2.])
Dimensions without coordinates: x

I think there might be use cases for this, like constructing masks or advanced indexing arrays maybe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, in my (probably naive) mental model of filling, this wouldn't work like this. How would we explain in words why we get the additional 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point... My view on this:

  • fill_gaps() decides which values to fill (here 1 step backward, i.e. the right edge of every gap)
  • The following function decides how they are filled (ffill here)

Both are two independent steps.
Maybe it is clearer to use 'left'/'right' instead of 'forward'/'backward' terminology?

limit_side: {"left", "right", "both"}, default: "both"
    Whether to limit the filling to the left, right, or both sides of the gap
     - "left": Fill only the left (lower index) side of each gap.
     - "right": Fill only the right (higher index) side of each gap.
     - "both": Fill from both sides of each gap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also join this Wednesday developer meeting to get feedback on naming/API things.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. I personally think that's quite confusing, but 50% chance it's just me.

Do others have a view?

I'm thinking that filling is simpler than this: it starts at the edge of a gap — and then fills the gap from the edge, which is by definition backwards or forwards (because it's already at an edge so can only go one direction)...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(yes re Wednesday calls! Unfortunately it clashes with something I have so I find it tough to go, but others will be there!)

xarray/core/dataset.py Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
| datetime.timedelta
) = None,
limit_direction: LimitDirectionOptions = "both",
limit_area: LimitAreaOptions | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refine-y but:

  • are interpolate and extrapolate clearer options?
  • is there a better name for the param? I'm not sure. "area" sounds 2D to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I adopted this nomenclature from pandas. What about limit_region, does this sound less 2D?
Personally, I like the 'inside'/'outside' options, 'interpolate' sounds close to linear interpolation to me...

xarray/core/dataset.py Outdated Show resolved Hide resolved
@@ -3766,6 +3769,160 @@ def bfill(self, dim: Hashable, limit: int | None = None) -> Self:

return bfill(self, dim, limit=limit)

def fill_gaps(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts from others on the naming? Would .fill be insufficiently specific that it's filling na? Would fill_missing be clearer than fill_gaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to any of those. .fill sounds very concise, but maybe this is easily confused with .ffill

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think .fill could be quite nice, do others have a view?

Copy link
Contributor

@dcherian dcherian Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe gap_filler instead, since this method does not actually fill the gaps.

I'm also wondering if its better to have a method that constructs the appropriate mask that can be used later

mask = ds.get_gap_mask(max_gap=...)
ds.ffill(...).where(~mask)

Copy link
Contributor Author

@Ockenfuss Ockenfuss Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points! Just to explain:

gap_filler emphasizes the returned object type nicely! However, I choose fill_gaps because it fits the naming scheme of other object-returning functions better (e.g. rolling and coarsen are not called roller and coarser in xarray, even though the operation is not perfomed immediately and an object is returned).
Ultimately (I am a non-native english speaker) I am happy for any recommendations regarding nomenclature.
If you prefer gap_filler, I will change accordingly.

The function API is also presented as an alternative in the initial proposal. I decided to go for the object way because it is shorter (one line) and less error prone (you might easily forget the ~). If the mask is required, you can easily get it from the object:

mask=ds.gap_filler(...).mask

xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically all the new methods are missing return types, please consider adding them.

The new class should be generic, I have added a few review comments here and there but more adoptions are needed.

xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Show resolved Hide resolved
@Ockenfuss
Copy link
Contributor Author

Basically all the new methods are missing return types, please consider adding them.

The new class should be generic, I have added a few review comments here and there but more adoptions are needed.

Thanks a lot for the review of the typing! I`ll work through them asap and try to make all adoptions (probabily still need some help afterwards, though ...)

xarray/core/dataarray.py Outdated Show resolved Hide resolved
@Ockenfuss
Copy link
Contributor Author

@headtr1ck I did my best to add type hints everywhere. Not sure how to resolve the last four remaining mypy errors...

@keewis keewis linked an issue Aug 26, 2024 that may be closed by this pull request
@headtr1ck
Copy link
Collaborator

@headtr1ck I did my best to add type hints everywhere. Not sure how to resolve the last four remaining mypy errors...

I'm currently on holiday and it's difficult to debug these errors on the phone.
I can do it in 1-2 weeks or someone else can help out?

self.content = content
self.mask = mask
self.dim = dim
self.dim = dim
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.dim = dim

# Calculate individual masks
masks = []
if need_limit_mask:
masks.append(
Copy link
Collaborator

@headtr1ck headtr1ck Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is difficult to tell mypy that limit cannot be a Hashable here do to the dynamic way of how it is handled.
Maybe the best course of action is to add an assertion (not sure which exactly, maybe assert not utils.is_hashable(limit), but maybe this is too restrictive?)
Easier might be to add a #type: ignore[arg-type] and a little comment as to why.

@max-sixty
Copy link
Collaborator

max-sixty commented Sep 10, 2024

There are a couple of outstanding questions about the interfaces; in particular:

Re typing — are the interfaces typed correctly? I think they are. Assuming those are correct, I would vote that it's fine to add ignores internally (ofc the internal checks are useful tests for the external interfaces, so ideally we'd fix those)

What's the best way of resolving those outstanding questions @pydata/xarray ? I know I've been terrible on attending calls now I'm on PT, but maybe we could do it on one of those? (would be great to minimize the delay on these sorts of PRs — I think they're really valuable and would love to engender more...)

@Ockenfuss
Copy link
Contributor Author

Ockenfuss commented Sep 11, 2024

One more naming decision:

  • Fill gaps limited 7665 #9402 (comment)
    I am on conference this week, but mostly available the next two weeks, if anyone wants to discuss. Otherwise, I will join the next dev meeting on Sep. 25.

@max-sixty
Copy link
Collaborator

Did you manage to discuss this this morning?

@kmuehlbauer
Copy link
Contributor

Did you manage to discuss this this morning?

@max-sixty Yes, @Ockenfuss explained what decisions are to be made and encouraged to have a look here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants