ADD: expanded documentation (#33)

esgf2-us · Mar 5, 2024 · ff33789 · ff33789
1 parent 76fdb8e
commit ff33789
Show file tree

Hide file tree

Showing 5 changed files with 150 additions and 7 deletions.
diff --git a/doc/conf.py b/doc/conf.py
@@ -48,6 +48,7 @@
 
 # Disable cell timeout
 nbsphinx_timeout = -1
+nb_execution_timeout = -1
 nbsphinx_prolog = ""
 
 # The encoding of source files.

diff --git a/doc/configure.md b/doc/configure.md
@@ -9,7 +9,7 @@ kernelspec:
 
 # Configuring the `ESGFCatalog`
 
-By default, the ESGFCatalog is configured to point at a Globus-based index (build on [Elasticsearch](https://www.elastic.co/)) with information about holdings at the ALCF (Argonne Leadership Computing Facility). This is a temporary default while we work on redesigning the index.
+By default, the ESGFCatalog is configured to point at Globus-based indices (build on [Elasticsearch](https://www.elastic.co/)) with information about holdings at the OLCF (Oak Ridge Leadership Computing Facility) and ALCF (Argonne Leadership Computing Facility). This is a temporary default while we work on redesigning the index.
 
 ```{code-cell}
 from intake_esgf import ESGFCatalog
@@ -34,8 +34,8 @@ for ind in cat.indices:
 
 ## Setting the local cache
 
-To do. Comment about working on shared resourece where you have a quota.
+The location to which we will download ESGF holdings is set to `${HOME}/.esgf` by default. However, you may change this location by a call to `cat.set_local_cache_directory()`. This can be particularly useful if you are working on a shared resource such as an institutional cluster or group workstation. On these machines, your home directory could have a limiting memory quota which can be avoided by pointing the cache directory to a shared project directory. This has the added benefit that others with read access to your project can use the data. Note that at this time, you will need to set the local cache in your analysis scripts before downloading/loading any data.
 
 ## Using data directly
 
-To do.
+This package is designed to download data only if it is absolutely necesary. If you are working on a resource with direct access to some portion of the ESGF holdings, you can point to it with `cat.set_esgf_data_root()`. This will add a read-only location to check for data. We check for a few locations automatically when the package is instantiated. If you would like a location added to our [defaults](https://github.com/esgf2-us/intake-esgf/blob/76fdb8e943f73813160bd76544d5d471c25f2a2d/intake_esgf/base.py#L169), please feel free to submit a [issue](https://github.com/esgf2-us/intake-esgf/issues/new?assignees=&labels=&projects=&template=feature_request.md&title=).
diff --git a/doc/index.rst b/doc/index.rst
@@ -50,13 +50,12 @@ or through ``conda-forge``
    :hidden:
    :caption: Features
 
-   configure
    measures
-   dictkeys
-   logging
    modelgroups
-   operators
+   configure
    reproduce
+   dictkeys
+   logging
 
 .. toctree::
    :maxdepth: 2

diff --git a/doc/modelgroups.md b/doc/modelgroups.md
@@ -1 +1,80 @@
+---
+jupytext:
+  text_representation:
+    format_name: myst
+kernelspec:
+  display_name: Python 3
+  name: python3
+---
+
 # Simplifying Search with Model Groups
+
+At a simple level, you can think of `intake-esgf` as analagous to the ESGF web [interface](https://aims2.llnl.gov/search) but where results are presented to you as a pandas dataframe in place of pages of web results. However, we believe that the user does not want to wade through either of these. Many times you want to see model results organized by unique combinations of `source_id`, `member_id`, and `grid_label`. That is to say, when you are going to perform an analysis, you would like your model outputs to be self-consistent and from the same run and grid, even across experiments. To assist you in honing in on what sets of results may be useful to your analysis, we introduce the notion of *model groups*.
+
+Consider the following search, motivated by a desire to study controls (temperature, precipitation, soil moisture) on the carbon cycle (gross primary productivty) across a number of historical and future scenarios.
+
+```{code-cell}
+from intake_esgf import ESGFCatalog
+cat = ESGFCatalog().search(
+    experiment_id=["historical", "ssp585", "ssp370", "ssp245", "ssp126"],
+    variable_id=["gpp", "tas", "pr", "mrso"],
+    table_id=["Amon", "Lmon"],
+)
+print(cat)
+```
+
+Even if this exact application does not resonate with you, the situation is a familiar one. We have several thousand results with many different models and variants to sort through. To help guide you to which groups of models might be useful to you, we provide the following function.
+
+```{code-cell}
+cat.model_groups()
+```
+
+This returns a pandas series where the results have been grouped and sorted by `source_id`, `member_id`, and `grid_label` and the counts of datasets returned. Pandas will probably truncate this series. If you want to see the whole series, you can call `print(cat.model_groups().to_string())` instead. However, as there are still several hundred possibile model groups, we will not show that here.
+
+## Removing Incomplete Groups
+
+If you glance through the model groups, you will see that, relative to our search, many will be *incomplete*. By this we mean, that there are many model groups that will not have all the variables in all the experiments that we wish to include in our analysis. Since we are looking for 5 experiments and 4 variables, we need the model groups with 20 dataset results. We can check which groups satisfy this condition by operating on the model group pandas series.
+
+```{code-cell}
+mgs = cat.model_groups()
+print(mgs[mgs==20])
+```
+
+The rest are incomplete and we would like a fast way to remove them from the search results. But the reality is that many times our *completeness* criteria is more complicated than just a number. In the above example, we may want all the variables for all the experiments, but if a model does not have a submission for, say, `ssp126`, that is acceptable.
+
+`intake-esgf` provides an interface which uses a user-provided function to remove incomplete entries. Internally, we will loop over all model groups in the results and pass your function the portion of the dataframe that corresponds to the current model group. Your function then needs to return a boolean based on the contents of that sub-dataframe.
+
+```{code-cell}
+def should_i_keep_it(sub_df):
+    # this model group has all experiments/variables
+    if len(sub_df) == 20:
+        return True
+    # if any of these experiments is missing a variable, remove this
+    for exp in ["historical", "ssp585", "ssp370", "ssp245"]:
+        if len(sub_df[sub_df["experiment_id"] == exp]) != 4:
+            return False
+    # if the check makes it here, keep it
+    return True
+```
+
+Then we pass this function to the catalog by the `remove_incomplete()` function and observe how it has impacted the search results.
+
+```{code-cell}
+cat.remove_incomplete(should_i_keep_it)
+print(cat.model_groups())
+```
+
+## Removing Ensembles
+
+Depending on the goals and scope of your analysis, you may want to use only a single variant per model. This can be challenging to locate as not all variants have all the experiments and models. However, now that we have removed the incomplete results, we can now call the `remove_ensembles()` function which will only keep the *smallest* `member_id` for each model group. By smallest, we mean that first entry after a hierarchical sort using the integer index values of each label in the `member_id`.
+
+```{code-cell}
+cat.remove_ensembles()
+print(cat.model_groups())
+```
+
+Now the results are much more manageable and ready to be downloaded for use in your analysis.
+
+## Feedback
+
+What do you [think](https://github.com/esgf2-us/intake-esgf/issues/new?assignees=&labels=&projects=&template=feature_request.md&title=) of this interface? We have found that it saves our students days of work, but are interested in critical feedback. Can you think of simpler interface? Are there other analysis tasks that are painful and time consuming that we could automate?
diff --git a/doc/reproduce.md b/doc/reproduce.md
@@ -1 +1,65 @@
+---
+jupytext:
+  text_representation:
+    format_name: myst
+kernelspec:
+  display_name: Python 3
+  name: python3
+---
+
+```{code-cell}
+---
+tags: [remove-cell]
+---
+from intake_esgf import ESGFCatalog
+```
+
 # Reproducibility
+
+If you are using ESGF data in an analysis publication, the journal to which you
+are submitting may require that you provide data citations or availability.
+While we are working on improving this in ESGF, we also wanted to highlight the
+current functionality. Consider the following query assumed to be used in an
+unspecified analysis. For comparison, we will print the underlying dataframe to
+show the results of the search.
+
+```{code-cell}
+cat = ESGFCatalog().search(
+    experiment_id="historical",
+    source_id="CanESM5",
+    variable_id=["gpp", "tas", "nbp"],
+    variant_label=["r1i1p1f1"],
+    frequency="mon",
+)
+cat.df
+```
+
+In the course of the analysis, you would download the datasets into a dictionary.
+
+```{code-cell}
+dsd = cat.to_dataset_dict(add_measures=False)
+```
+
+Then you may loop through the datasets and pull out the `tracking_id` from the
+global attributes of each dataset.
+
+```{code-cell}
+tracking_ids = [ds.tracking_id for _,ds in dsd.items()]
+for tracking_id in tracking_ids:
+    print(tracking_id)
+```
+
+The `tracking_id` is similar to a digital object identifier (DOI) and can be
+provided in some form in your paper or supplemental material to be precise about
+what ESGF data you used. If you have a list of `tracking_id`s, then you can pass
+them into `from_tracking_ids()` to reproduce the catalog.
+
+```{code-cell}
+new_cat = ESGFCatalog().from_tracking_ids(tracking_ids)
+new_cat.df
+```
+
+If you visually compare `cat` with `new_cat` you will see that they are the
+same. From here you may interact with the new catalog and recover the data you
+used if needed. This can also be used to quickly communicate the colleagues
+which data should be used.