Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADD: a doc page with some history/perspective and FAQ #76

Merged
merged 5 commits into from
Nov 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions doc/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
jupytext:
text_representation:
format_name: myst
kernelspec:
display_name: Python 3
name: python3
---

# Frequently Asked Questions

###### Why aren't the results I get with intake-esgf the same as Metagrid?

As of this writing (November 2024) intake-esgf is configured to, by default, use Globus/ElasticSearch-based indices we are developing which contain a lot of the global ESGF holdings but not all of them. We provide backward compatibility to the old Solr-based nodes which Metagrid is using. You just need to [configure](configure) intake-esgf to use them.

Note that another reason for asymmetrical results is that an index may have failed to return a response. In intake-esgf, we print a warning to the screen when this happens, but Metagrid will report results with no warning. This can give you the impression that an error has occurred when really it is due to undependable indices. The federation is currently working hard to improve this situation in time for CMIP7.

###### Are your catalogs CSV/JSON files?

No, intake-esgf catalogs are initially empty and are populated only when you perform a faceted search by communicating with remote index nodes. The intake-esgf package can be [configured](configure) to dynamically query a number of ESGF-developed index nodes. At the moment these include some Globus/ElasticSearch-based indices as well as the old Solr-based technologies.

###### What projects does intake-esgf support?

You can check the projects which are currently supported:

```{code-cell}
from intake_esgf import supported_projects
print(supported_projects())
```

In principle, intake-esgf can be made to support any project for which we have index data. It is a matter of implementing some boilerplate code that tells intake-esgf how certain facets behave (model names, ensemble members, etc.). If you have a need, raise an [issue](https://github.com/esgf2-us/intake-esgf/issues).
2 changes: 2 additions & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,9 @@ or through ``conda-forge``
:caption: User Guide

beginner
intake
quickstart
faq

.. toctree::
:maxdepth: 2
Expand Down
38 changes: 38 additions & 0 deletions doc/intake.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Why intake?

This package bears the name *intake* because of 2 python packages that are in use in the broader data and Earth systeming community:

- [intake](https://github.com/intake/intake): a lightweight package for finding, investigating, loading and disseminating data.
- [intake-esm](https://github.com/intake/intake-esm): an intake plugin for parsing an Earth System Model (ESM) catalog and loading assets into xarray datasets.

And while we certainly follow in their footsteps, you may notice that this package is not an intake plugin nor does it depend on either.

## intake

Formal intake catalogs are [meant](https://intake.readthedocs.io/en/latest/scope2.html#motivation) to describe the data that is available and free you from worrying about how to obtain it or get it into memory, no matter the storage format. This paradigm leaves you with concise data access code that is descriptive enough that it can be left in scripts.

In the case of ESGF data, static intake catalogs do not make sense. We have millions of datasets making the static catalogs far too large to be practical. Furthermore, ESGF data is already described in great detail in databases that can be queried using a web [interface](https://esgf-node.ornl.gov/search) or a [RESTful API](https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html). Even the intake documentation provides the following reason to not use intake: *all your data needs are already met by some other data service by itself, for example a set of tables/procedures/views/queries on a SQL database.*

However, while it may not make sense to create formal, static intake catalogs for all of the ESGF data, the simplicity of data access is a feature we would like to provide ESGF users.

## intake-esm

[intake-esm](https://github.com/intake/intake-esm) was created as an intake plugin that provides static indices for *portions* of the ESGF holdings that are hosted outside of the federation. The community has come up with many use [cases](https://intake-esm.readthedocs.io/en/stable/reference/faq.html#is-there-a-list-of-existing-catalogs) for these types of catalogs:

- Modeling centers downloaded large portions of the ESGF archive to their own storage systems and wanted a way to let their users discover these datasets.
- Portions of the archive were converted to Zarr format and hosted on Google Cloud.
- Modeling centers ran non-CMIP experiments and wanted a method to let users discover and access these datasets.

These catalogs and the intake-esm software was created to address these and other use cases for privately developed catalogs.

In the process, intake-esm made strides in a vastly improved user interface. Users of ESGF data are accustomed to a web-based facetted search with results returned as lists of datasets in a web page that may be browsed and included in a data cart. In intake-esm, the catalogs may be searched in the same way but with results returned as a pandas dataframe (the python standard for representing tabular data). Data download/access is then handled automatically with a simple call to `to_dataset_dict()`.

## intake-esgf

We found this paradigm to be powerful and thus intake-esgf was born out of a desire to provide the same interface to ESGF users with a few important differences.

- Our catalogs initialize empty and are populated *dynamically* by searching a configurable list of indices.
- When searching for file information, we hide the data location, form a composite view of all access methods, and select the fastest option for the user.
- Data transfers only occur if necessary. Before making a transfer we check that the data does not already exist in a cache or other local source or that streaming was not preferred and available. This makes supporting data lakes simple and provides a method to accommodate other data access options in a single interface.

So while we do not explicitly depend on intake or intake-esm, our package is heavily influenced by both of these works. We have decided to keep the name `intake-esgf` despite the potential confusion as an homage to the packages that inspired this work.
9 changes: 8 additions & 1 deletion intake_esgf/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,17 @@ def in_notebook() -> bool:
return False # Probably standard Python interpreter


def supported_projects() -> list[str]:
"""What projects are supported?"""
from intake_esgf.projects import projects

return list(projects.keys())


IN_NOTEBOOK = in_notebook()

from intake_esgf.catalog import ESGFCatalog # noqa
from intake_esgf.config import conf # noqa
from intake_esgf._version import __version__ # noqa

__all__ = ["ESGFCatalog", "conf", "IN_NOTEBOOK"]
__all__ = ["ESGFCatalog", "conf", "IN_NOTEBOOK", "supported_projects"]
Loading