Refactor of `to_dataset_dict()` #47

nocollier · 2024-04-25T20:16:56Z

My implementation of to_dataset_dict() has become ugly and complicated. In particular as we think of adding streaming options and integrating intake-esm catalogs, we should refactor.

The text was updated successfully, but these errors were encountered:

nocollier · 2024-04-29T13:22:04Z

I need to catch up on other projects, but I think we should try to just make a flowchart of what my to_dataset_dict() does, what intake-esm's does, and then how we want it to work.

mgrover1 · 2024-04-29T13:38:27Z

Same here - I can draft up a diagram of what intake-esm and we can revisit!

nocollier · 2024-04-29T13:48:15Z

  flowchart TD
    A[Did you know]
    B[that this works?]
    A --> B

mgrover1 · 2024-04-29T14:11:34Z

mind = blown! This is slick.

nocollier · 2024-07-23T17:33:47Z

@mgrover1 Take a read and let's have a chat in the new few days if you can?

I am close to being able to get on this rework. I wanted to take a stab at organizing my thoughts on what needs to happen. Currently, to_dataset_dict() tries to do too much at once and thus it is hard to change. I think we need to reactor into a few functions:

get_file_info(): Get all the information of all the files associated with all the datasets in all locations. Currently part of to_dataset_dict() and should just be its own function. I do not think that users need to call this, but they may want to.
form_dict_keys(): The dataset dictionary that is returned in to_dataset_dict() needs unique keys which we take from the unique facets of the datasets. I had hoped that the ESGF system could tell me which facets (and order) define unique datasets, but this information is not uniformly part of the database records. This is supposedly something that the control vocabulary work can tell me but I have not figured out how yet. I also doubt that they have it set up for all projects.
- There is a annoyance at how CMIP5 was published that we can fix at this level (I do this currently in intake-esgf). Their dataset_ids are defined at the level of table and not variable. You can fix this by manually altering the dataset_id, adding the variable to it. You also then need to highjack the keys process to ensure files get associated with a fixed dataset_id.
- I personally dislike having to type a lot of things when referencing keys. Thus my default in intake-esgf is to only include facets which have different values across the datasets being returned in a call of to_dataset_dict(). I also include a ignore_facet keyword that lets you manually exclude any facet you want. Sometimes you have different tables but they aren't really needed to define a unique dataset in your set of datasets.
- Given that ESGF has decided to not immediately support anything but CMIP6, CMIP6Plus, and CMIP7, I think that we can hard code which facets define unique datasets. I will keep support for CMIP3 and CMIP5 for now, but I am not going to worry about supporting everything abstractly.
- I do not think that this a function that a user needs to call, but having it will help separate concerns.
to_{file|path}_dict(): Some users may not want (or be able) to buy into the xarray paradigm and our package forces them to use xarray objects needlessly. For example, in my current version of ILAMB I have not been able to complete a rework to use xarray objects. To make use of intake-esgf, I just need file paths. I am not sure what to call the function.
- In my mind, this function would do the actual downloading/loading and/or pass the handle as needed. It would be used by to_dataset_dict().
- I envision an option like prefer_streaming={True|"zarr"|"opendap"}. There are many link types in the records and we are looking at supporting VirtualZarr. Also, if we can provide a index type that can load the intake-esm catalogs, then we can harvest the handles and return those as well.

Other issues:

What should happen when a download fails (that is, we exhaust all possible download locations)? This could also happen if one file of several defining a dataset fails to download.
1. We could raise an exception for the entire to_dataset_dict() call. If you were depending on the entire catalog, you may want your script to fail if you do not get it all. Perhaps a custom exception CatalogDownloadIncomplete()?
2. We could return the dictionary quietly (always logging) without the dataset with the missing file, in the spirit of giving the user the best we can without failing.
3. We could return the dictionary quietly with the key of the dataset missing a file, but the value resolves to an exception DatasetLoadError(). Is it ok to return an exception? Is that pythonic? Users would need to know that keys can resolve to an exception.

mgrover1 · 2024-07-24T18:23:02Z

@nocollier - are you available to chat tomorrow? I am free ~10 AM central.

nocollier · 2024-07-24T19:54:52Z

That should work, I will reach out to you tomorrow.

…

On Wed, Jul 24, 2024 at 2:23 PM Max Grover ***@***.***> wrote: @nocollier <https://github.com/nocollier> - are you available to chat tomorrow? I am free ~10 AM central. — Reply to this email directly, view it on GitHub <#47 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKFCB5NL7X6I5HKTAPXORTZN7WJ3AVCNFSM6AAAAABGZRUUEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBYGY2DQMJSG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

nocollier · 2024-07-30T19:34:50Z

Another problem with ESGF data that to_dataset_dict() needs to solve:

There is some inconsistency is the usage of variant_label and member_id. I thought they were synonyms, but in fact they can be different (see this document). In short, the member_id would include a sub_experiment_id if present. See this from that document:

	if sub_experiment_id = “none”
		member_id = <variant_label>
	else
		member_id = <sub_experiment_id>-<variant_label>
	endif

The problem is that the master_id contains the member_id facet, but it is not part of the global attributes of files (only variant_label). Since we have all of this information in the pandas dataframe, we should ensure that the global attributes of our returned datasets contain all facets which are part of the master_id.
/

nocollier assigned mgrover1 Apr 25, 2024

nocollier added the enhancement New feature or request label Apr 30, 2024

nocollier mentioned this issue Jul 29, 2024

ADD: ESGF project abstraction #63

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor of `to_dataset_dict()` #47

Refactor of `to_dataset_dict()` #47

nocollier commented Apr 25, 2024

nocollier commented Apr 29, 2024

mgrover1 commented Apr 29, 2024

nocollier commented Apr 29, 2024

mgrover1 commented Apr 29, 2024

nocollier commented Jul 23, 2024 •

edited

Loading

mgrover1 commented Jul 24, 2024

nocollier commented Jul 24, 2024 via email

nocollier commented Jul 30, 2024

Refactor of to_dataset_dict() #47

Refactor of to_dataset_dict() #47

Comments

nocollier commented Apr 25, 2024

nocollier commented Apr 29, 2024

mgrover1 commented Apr 29, 2024

nocollier commented Apr 29, 2024

mgrover1 commented Apr 29, 2024

nocollier commented Jul 23, 2024 • edited Loading

mgrover1 commented Jul 24, 2024

nocollier commented Jul 24, 2024 via email

nocollier commented Jul 30, 2024

Refactor of `to_dataset_dict()` #47

Refactor of `to_dataset_dict()` #47

nocollier commented Jul 23, 2024 •

edited

Loading