Load `esmvalcore.dataset.Dataset` objects in parallel using Dask #2517

bouweandela · 2024-09-09T07:52:13Z

Description

Load the individual files in a dataset in parallel using Dask and add the option to get a dask.delayed.Delayed back from esmvalcore.dataset.Dataset that can be fed to dask.compute to get an iris.cube.Cube. This can considerably speed up loading datasets that consist of many files or, when used with the delayed option, speed up loading multiple datasets.

Related to #2300 and #2316

Link to documentation: https://esmvaltool--2517.org.readthedocs.build/projects/ESMValCore/en/2517/api/esmvalcore.dataset.html#esmvalcore.dataset.Dataset.load

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

🧪 The new functionality is relevant and scientifically sound
🛠 This pull request has a descriptive title and labels
🛠 Code is written according to the code quality guidelines
🧪 and 🛠 Documentation is available
🛠 Unit tests have been added
🛠 ~~Changes are backward compatible~~ The preprocessor function fix_metadata no longer groups by the attribute "source_file".
🛠 The list of authors is up to date
🛠 All checks below this pull request were successful

To help with the number pull requests:

🙏 We kindly ask you to review two other open pull requests in this repository

codecov · 2024-09-09T07:58:13Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.67%. Comparing base (a328578) to head (52cc1ed).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2517   +/-   ##
=======================================
  Coverage   94.66%   94.67%           
=======================================
  Files         251      251           
  Lines       14287    14297   +10     
=======================================
+ Hits        13525    13535   +10     
  Misses        762      762

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

valeriupredoi

this is brilliant, bud! I've been meaning to get delayed in places in Core for some time. Got one possible nagging comment through - from https://docs.dask.org/en/stable/delayed-best-practices.html they say "Every delayed task has an overhead of a few hundred microseconds. Usually this is ok, but it can become a problem if you apply dask.delayed too finely. In this case, it’s often best to break up your many tasks into batches or use one of the Dask collections to help you." - I am guessing this applies to O(millions) (at least) but can we maybe run a test with one of those mega recipes that loads hundreds of datasets?

valeriupredoi · 2024-09-20T12:48:21Z

oh and maybe a line or two in the documentation perhaps? Bit of an advanced topic, so maybe a very short reference

…l-load

bouweandela · 2024-10-23T14:54:58Z

This may need a bit more testing. The recipe below fails with the distributed scheduler and the iris main branch.

# ESMValTool
# recipe_python.yml
#
# See https://docs.esmvaltool.org/en/latest/recipes/recipe_examples.html
# for a description of this recipe.
#
# See https://docs.esmvaltool.org/projects/esmvalcore/en/latest/recipe/overview.html
# for a description of the recipe format.
---
documentation:
  description: |
    Example recipe that plots a map and timeseries of temperature.

  title: Recipe that runs an example diagnostic written in Python.

  authors:
    - andela_bouwe
    - righi_mattia

  maintainer:
    - schlund_manuel

  references:
    - acknow_project

  projects:
    - esmval
    - c3s-magic

datasets:
  - {dataset: FGOALS-f3-L, ensemble: 'r1i1p1f1', grid: gn}

preprocessors:
  # See https://docs.esmvaltool.org/projects/esmvalcore/en/latest/recipe/preprocessor.html
  # for a description of the preprocessor functions.

  annual_mean_global:
    area_statistics:
      operator: mean
    annual_statistics:
      operator: mean
    convert_units:
      units: degrees_C

diagnostics:

  timeseries:
    description: Annual mean temperature in Amsterdam and global mean since 1850.
    themes:
      - phys
    realms:
      - atmos
    variables:
      tos_global:
        short_name: tos
        mip: Omon
        project: CMIP6
        exp: [historical, ssp585]
        preprocessor: annual_mean_global
        timerange: 1850/2100
        caption: Annual global mean {long_name} according to {dataset}.
    scripts:
      script1:
        script: examples/diagnostic.py
        quickplot:
          plot_type: plot

bouweandela · 2024-10-24T08:19:33Z

This issue mentioned above is fixed by SciTools/iris#6187.

schlunma · 2024-10-30T17:15:42Z

esmvalcore/dataset.py

@@ -765,6 +798,51 @@ def _load(self) -> Cube:
            **self.facets,
        }
        settings["concatenate"] = {"check_level": self.session["check_level"]}
+
+        result = []
+        for input_file in input_files:


This changes how data is passed through the different preprocessor functions, doesn't it?

Right now, for example, fix_metadata will get ALL cubes from ALL files as input. With this change here, it will only get the cubes from one file, right?

I know that fix_metadata itself groups by file, but this is already very problematic (see #1806 and #2551).

I also fear that this might have other undesired side effects. Why do you need to treat these first preprocessor functions differently in the new code?

Why do you need to treat these first preprocessor functions differently in the new code?

To improve parallelism. Like this, each input file can be loaded and preprocessed up to the concatenate step in parallel.

This changes how data is passed through the different preprocessor functions, doesn't it?

No, it just takes the grouping out of fix_metadata and implements it in the function calling fix_metadata to enable additional parallelism. If this pull request is merged, #2551 would need to be updated to do the grouping here instead of inside fix_metadata.

Okay, I think I misunderstood the code in the first place. The function preprocess is not at all straightforward when it comes to handling of input and output types...I agree that the behavior has not changed.

I will test this with a couple of recipes once Levante is running again next week. In the meantime, would it make sense to remove the grouping of files in fix_metadata? It would be confusing to have this in two places of the code. I know that this wouldn't be strictly backwards-compatible, but the grouping was only enabled if the cubes have a source_file attribute (which is probably only the case when used within ESMValTool). I highly doubt that this function would be very useful outside of ESMValTool anyway.

I removed the grouping in d5a39af, but where would you suggest we remove the "source_file" attribute now? Apart from grouping, it is also used to generate error messages from the cmor checkers. Should it be removed after cmor_check_data?

Good question. I would either remove it after cmor_check_data or remove it altogether from the code. The preprocessors log all filenames anyway now, so its not as important anymore as it used to be.

schlunma

Sorry, I fear that this will break existing recipes due to changes to the preprocessing pipeline. Will remove this block once resolved.

esmvalcore/dataset.py

bouweandela · 2024-11-14T08:15:27Z

Sorry, I fear that this will break existing recipes due to changes to the preprocessing pipeline. Will remove this block once resolved.

Thanks for reviewing @schlunma! As far as I can see it does not change the preprocessing pipeline, but maybe you can find a case where it does? Maybe you could run a recipe that you think could potentially be broken as a test and report back the result?

…l-load

bouweandela added the dask related to improvements using Dask label Sep 9, 2024

bouweandela added 2 commits September 9, 2024 14:13

Parallel load datasets

4649d87

Add test and improve documentation

bc889ba

bouweandela force-pushed the parallel-load branch from 56d24a9 to bc889ba Compare September 9, 2024 12:13

bouweandela marked this pull request as ready for review September 11, 2024 07:15

This was referenced Sep 12, 2024

Performance improvement: recipe_extremes_wind_3h.yml #2301

Open

Performance improvement: recipe_easy_ipcc.yml #2300

Open

Merge branch 'main' into parallel-load

a21996d

valeriupredoi approved these changes Sep 20, 2024

View reviewed changes

bouweandela added 5 commits September 26, 2024 21:39

Use ruff formatting

0bd3da1

Merge remote-tracking branch 'origin/main' into parallel-load

7e232a3

Add type hint

4057950

Merge branch 'main' of github.com:ESMValGroup/ESMValCore into paralle…

d19ce4d

…l-load

Merge branch 'main' into parallel-load

078b5bf

Mark preprocessor functions that modify input as not pure

81b41a0

schlunma reviewed Oct 30, 2024

View reviewed changes

schlunma requested changes Oct 30, 2024

View reviewed changes

schlunma reviewed Oct 31, 2024

View reviewed changes

esmvalcore/dataset.py Show resolved Hide resolved

Remove grouping by file inside fix_metadata

d5a39af

bouweandela added the backwards incompatible change label Nov 15, 2024

Merge branch 'main' of github.com:ESMValGroup/ESMValCore into paralle…

52cc1ed

…l-load

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load `esmvalcore.dataset.Dataset` objects in parallel using Dask #2517

Load `esmvalcore.dataset.Dataset` objects in parallel using Dask #2517

bouweandela commented Sep 9, 2024 •

edited

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading

valeriupredoi left a comment

valeriupredoi commented Sep 20, 2024

bouweandela commented Oct 23, 2024 •

edited

Loading

bouweandela commented Oct 24, 2024

schlunma Oct 30, 2024 •

edited

Loading

bouweandela Nov 14, 2024

schlunma Nov 14, 2024

bouweandela Nov 15, 2024

schlunma Nov 15, 2024

schlunma left a comment •

edited

Loading

bouweandela commented Nov 14, 2024

Load esmvalcore.dataset.Dataset objects in parallel using Dask #2517

Are you sure you want to change the base?

Load esmvalcore.dataset.Dataset objects in parallel using Dask #2517

Conversation

bouweandela commented Sep 9, 2024 • edited Loading

Description

Before you get started

Checklist

codecov bot commented Sep 9, 2024 • edited Loading

Codecov Report

valeriupredoi left a comment

Choose a reason for hiding this comment

valeriupredoi commented Sep 20, 2024

bouweandela commented Oct 23, 2024 • edited Loading

bouweandela commented Oct 24, 2024

schlunma Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

bouweandela Nov 14, 2024

Choose a reason for hiding this comment

schlunma Nov 14, 2024

Choose a reason for hiding this comment

bouweandela Nov 15, 2024

Choose a reason for hiding this comment

schlunma Nov 15, 2024

Choose a reason for hiding this comment

schlunma left a comment • edited Loading

Choose a reason for hiding this comment

bouweandela commented Nov 14, 2024

Load `esmvalcore.dataset.Dataset` objects in parallel using Dask #2517

Load `esmvalcore.dataset.Dataset` objects in parallel using Dask #2517

bouweandela commented Sep 9, 2024 •

edited

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading

bouweandela commented Oct 23, 2024 •

edited

Loading

schlunma Oct 30, 2024 •

edited

Loading

schlunma left a comment •

edited

Loading