Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor of to_dataset_dict() #47

Open
nocollier opened this issue Apr 25, 2024 · 8 comments
Open

Refactor of to_dataset_dict() #47

nocollier opened this issue Apr 25, 2024 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@nocollier
Copy link
Member

My implementation of to_dataset_dict() has become ugly and complicated. In particular as we think of adding streaming options and integrating intake-esm catalogs, we should refactor.

@nocollier
Copy link
Member Author

I need to catch up on other projects, but I think we should try to just make a flowchart of what my to_dataset_dict() does, what intake-esm's does, and then how we want it to work.

@mgrover1
Copy link
Member

Same here - I can draft up a diagram of what intake-esm and we can revisit!

@nocollier
Copy link
Member Author

  flowchart TD
    A[Did you know]
    B[that this works?]
    A --> B
Loading

@mgrover1
Copy link
Member

mind = blown! This is slick.

@nocollier nocollier added the enhancement New feature or request label Apr 30, 2024
@nocollier
Copy link
Member Author

nocollier commented Jul 23, 2024

@mgrover1 Take a read and let's have a chat in the new few days if you can?

I am close to being able to get on this rework. I wanted to take a stab at organizing my thoughts on what needs to happen. Currently, to_dataset_dict() tries to do too much at once and thus it is hard to change. I think we need to reactor into a few functions:

  • get_file_info(): Get all the information of all the files associated with all the datasets in all locations. Currently part of to_dataset_dict() and should just be its own function. I do not think that users need to call this, but they may want to.
  • form_dict_keys(): The dataset dictionary that is returned in to_dataset_dict() needs unique keys which we take from the unique facets of the datasets. I had hoped that the ESGF system could tell me which facets (and order) define unique datasets, but this information is not uniformly part of the database records. This is supposedly something that the control vocabulary work can tell me but I have not figured out how yet. I also doubt that they have it set up for all projects.
    • There is a annoyance at how CMIP5 was published that we can fix at this level (I do this currently in intake-esgf). Their dataset_ids are defined at the level of table and not variable. You can fix this by manually altering the dataset_id, adding the variable to it. You also then need to highjack the keys process to ensure files get associated with a fixed dataset_id.
    • I personally dislike having to type a lot of things when referencing keys. Thus my default in intake-esgf is to only include facets which have different values across the datasets being returned in a call of to_dataset_dict(). I also include a ignore_facet keyword that lets you manually exclude any facet you want. Sometimes you have different tables but they aren't really needed to define a unique dataset in your set of datasets.
    • Given that ESGF has decided to not immediately support anything but CMIP6, CMIP6Plus, and CMIP7, I think that we can hard code which facets define unique datasets. I will keep support for CMIP3 and CMIP5 for now, but I am not going to worry about supporting everything abstractly.
    • I do not think that this a function that a user needs to call, but having it will help separate concerns.
  • to_{file|path}_dict(): Some users may not want (or be able) to buy into the xarray paradigm and our package forces them to use xarray objects needlessly. For example, in my current version of ILAMB I have not been able to complete a rework to use xarray objects. To make use of intake-esgf, I just need file paths. I am not sure what to call the function.
    • In my mind, this function would do the actual downloading/loading and/or pass the handle as needed. It would be used by to_dataset_dict().
    • I envision an option like prefer_streaming={True|"zarr"|"opendap"}. There are many link types in the records and we are looking at supporting VirtualZarr. Also, if we can provide a index type that can load the intake-esm catalogs, then we can harvest the handles and return those as well.

Other issues:

  • What should happen when a download fails (that is, we exhaust all possible download locations)? This could also happen if one file of several defining a dataset fails to download.
    1. We could raise an exception for the entire to_dataset_dict() call. If you were depending on the entire catalog, you may want your script to fail if you do not get it all. Perhaps a custom exception CatalogDownloadIncomplete()?
    2. We could return the dictionary quietly (always logging) without the dataset with the missing file, in the spirit of giving the user the best we can without failing.
    3. We could return the dictionary quietly with the key of the dataset missing a file, but the value resolves to an exception DatasetLoadError(). Is it ok to return an exception? Is that pythonic? Users would need to know that keys can resolve to an exception.

@mgrover1
Copy link
Member

@nocollier - are you available to chat tomorrow? I am free ~10 AM central.

@nocollier
Copy link
Member Author

nocollier commented Jul 24, 2024 via email

@nocollier
Copy link
Member Author

Another problem with ESGF data that to_dataset_dict() needs to solve:

There is some inconsistency is the usage of variant_label and member_id. I thought they were synonyms, but in fact they can be different (see this document). In short, the member_id would include a sub_experiment_id if present. See this from that document:

	if sub_experiment_id = “none”
		member_id = <variant_label>
	else
		member_id = <sub_experiment_id>-<variant_label>
	endif

The problem is that the master_id contains the member_id facet, but it is not part of the global attributes of files (only variant_label). Since we have all of this information in the pandas dataframe, we should ensure that the global attributes of our returned datasets contain all facets which are part of the master_id.
/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants