Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miscellaneous lazy preprocessor improvements #2520

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

bouweandela
Copy link
Member

@bouweandela bouweandela commented Sep 10, 2024

Description

Various improvements to the preprocessor functions related to handling lazy data.

Related to #674

Link to documentation:


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

Copy link

codecov bot commented Sep 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.66%. Comparing base (9b9a125) to head (904e32a).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2520   +/-   ##
=======================================
  Coverage   94.66%   94.66%           
=======================================
  Files         251      251           
  Lines       14287    14296    +9     
=======================================
+ Hits        13525    13534    +9     
  Misses        762      762           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@bouweandela bouweandela added preprocessor Related to the preprocessor dask related to improvements using Dask labels Sep 11, 2024
@bouweandela bouweandela marked this pull request as ready for review October 14, 2024 16:05
@valeriupredoi
Copy link
Contributor

@bouweandela this is a lovely PR - did you want to still work on it? It's been a while since it's been sitting out, let's get it in if you done working on it 🍺

@bouweandela
Copy link
Member Author

bouweandela commented Nov 13, 2024

It's ready for review. I would like to add more tests, but I'm not sure when I'll have time to do that, so maybe it's better to just go ahead.

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot @bouweandela - a few very minor points, until the last one - chunks of aux factories/coords - me not happy about that, see me comment 🍺

esmvalcore/preprocessor/_area.py Outdated Show resolved Hide resolved
corresponds to a dimension of *mask* and its value provides
the index in *array* which the dimension of *mask* corresponds
to, so the first element of *dim_map* gives the index of *array*
that corresponds to the first dimension of *mask* etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be a bit more clear:

  • is it a list/tuple of ints? or of lists/tuples?
  • give an example e.g. dim_map = [2, 3, 4] -> array[2] will be masked with mask[0]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copy/pasted from the iris docs. I could also just link that? Anyway, this is not part of our public API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha OK, pls post a link then - it's as clear as a foggy November day in the North Atlantic 😁 The module is private, but this seems to be a decent function for our public API? Up to you on that

Copy link
Member Author

@bouweandela bouweandela Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it in 904e32a so the docstring now just refers to the iris.util.broadcast_to_shape function. That function has examples: https://scitools-iris.readthedocs.io/en/stable/generated/api/iris.util.html#iris.util.broadcast_to_shape

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module is private, but this seems to be a decent function for our public API?

Let's keep it private for now, I'm not sure if this is the kind of functionality ESMValCore should provide.

esmvalcore/preprocessor/_shared.py Show resolved Hide resolved

mask = iris.util.broadcast_to_shape(
mask, array.shape, dim_map=dim_map, chunks=array_chunks
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to put this in the func docstring: Raises: whatever the iris utility may raise - that utility is about 90% of this function, so I'd put it in the description/API doc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunks + (None,)
)
cube.replace_coord(coord)
return cube
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a costly function - is it worth its cost vs not doing anything and going with default chunks for aux coords? Also, I don't like it when we do things in our code, that should be done by Iris - this is probably something they should do at load point ie make sure the chunks are decent. Am rather not happy about this TBH, if I understand the issue correctly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a workaround for SciTools/iris#5457

Not doing anything will result in all workers dying when using the distributed scheduler because they use too much memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh absolutely, I commend your work 🎖️ I am frowning at Iris, bud - when you plop all the other suggestions, I'll Approve, am aware this needs be here, am just not happy it's not at iris

bouweandela and others added 2 commits November 20, 2024 16:39
Co-authored-by: Valeriu Predoi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask related to improvements using Dask preprocessor Related to the preprocessor
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants