add read-in functionality for deeply nested variables #281

JessicaS11 · 2022-02-25T15:50:09Z

When reading in some deeper nested variables (here canopy/h_canopy) in ATL08, xarray does not read in the coordinates for the data which are at the canopy level. This caused merge issues when trying to build a DataSet of all wanted variables. This PR adds functionality that handles these more deeply nested variables by assigning them the proper coordinate values. It also fixes a few typos and an index error that was occurring during data read-in.

…length var and varpath lists

…levels

…ed variables

review-notebook-app · 2022-02-25T15:50:13Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

github-actions · 2022-02-25T15:50:26Z

👈 Launch a binder notebook on this branch for commit 763305b

I will automatically update this comment whenever this PR is modified

👈 Launch a binder notebook on this branch for commit 1e8b785

👈 Launch a binder notebook on this branch for commit c385251

👈 Launch a binder notebook on this branch for commit 39a04ce

👈 Launch a binder notebook on this branch for commit c3287b8

👈 Launch a binder notebook on this branch for commit 97a70b2

codecov-commenter · 2022-02-25T16:03:20Z

Codecov Report

Merging #281 (97a70b2) into development (aa61e3f) will decrease coverage by 0.41%.
The diff coverage is 11.53%.

@@               Coverage Diff               @@
##           development     #281      +/-   ##
===============================================
- Coverage        55.47%   55.06%   -0.42%     
===============================================
  Files               31       31              
  Lines             2017     2034      +17     
  Branches           413      420       +7     
===============================================
+ Hits              1119     1120       +1     
- Misses             828      844      +16     
  Partials            70       70

Impacted Files	Coverage Δ
icepyx/core/is2ref.py	`23.52% <ø> (ø)`
icepyx/core/query.py	`51.76% <ø> (ø)`
icepyx/core/read.py	`27.95% <10.00%> (-1.53%)`	⬇️
icepyx/core/variables.py	`9.70% <16.66%> (-0.20%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa61e3f...97a70b2. Read the comment docs.

weiji14 · 2022-02-25T17:10:20Z

I'm not familiar with ATL08, so I'd recommend @linxiongecu check to make sure things work (perhaps try ~~pip install https://github.com/icesat2py/icepyx/archive/debug.zip~~ pip install git+https://github.com/icesat2py/icepyx.git@debug to install icepyx from this git branch). That said, doesn't seem like there's any glaring issues after scanning through the code changes.

P.S. If you're interested in a deep rabbit hole on nested variable support in xarray, have a look at pydata/xarray#4118.

JessicaS11 · 2022-02-25T19:28:53Z

P.S. If you're interested in a deep rabbit hole on nested variable support in xarray, have a look at pydata/xarray#4118.

I did visit this rabbit hole earlier this week. An interesting read.

@linxiongecu, are you available to review this PR? I can help you through that process if that helps.

ricardobarroslourenco · 2022-03-14T20:10:03Z

Hi everyone, I was able to replicate the error that @linxiongecu mentioned here (#277 (comment)).

For me when debugging with PyCharm, the kernel kill is like this:

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Which seems to be a segfault (according to this StackOverflow topic).

While debugging with breakpoints, the segfault happens when iterating on a for loop, at line 481 of the read.py file. It is an append operation of file URLs (I think), into a list, something quite trivial. I am not sure why, specifically, gets a segfault (which often is related to a python binding error with another language, such as C++, for example).

Any ideas @JessicaS11 ? I wonder if this is inheriting a xarray structure and, therefore, such binding may be indeed hapenning.

JessicaS11 · 2022-03-16T03:00:04Z

Thanks for digging into this @ricardobarroslourenco! You got me pointed in the right direction.

While debugging with breakpoints, the segfault happens when iterating on a for loop, at line 481 of the read.py file. It is an append operation of file URLs (I think), into a list, something quite trivial.

Close - the append there is adding an Xarray DataSet per file to a list of DataSets.

After some painful line-by-line debugging, I found that the line causing the seg fault is line 369 in read.py. I grabbed the value of is2ds.data_start_utc at that point and was able to reproduce the seg fault with:

import numpy as np
start_date = np.array([b'2021-12-22T00:13:10.417945Z'], dtype='|S27')
start_date.astype(np.datetime64)

I suspect it's related to the deprecation warning here.

I wanted to share this update before signing off for the day. If anyone has code for a fix already written, feel free to add it to the PR, otherwise I will try to get something in tomorrow.

ricardobarroslourenco · 2022-03-16T17:00:27Z

Happy to help @JessicaS11 ! I just saw your recent commit on the fix, and going over it this afternoon.

JessicaS11 · 2022-03-16T20:49:04Z

The fix should be in this branch. @linxiongecu @ricardobarroslourenco this PR is ready for review!

ricardobarroslourenco · 2022-03-16T21:16:04Z

Thanks @JessicaS11 ! I will get the updated debug, and see how it goes :)

ricardobarroslourenco

@JessicaS11 I have reviewed all *.py changes, and also the notebooks, and it seems that:

They both address the ATL08 issues discussed in this PR (I have also downloaded locally, and it worked on my machine);
The issues that @weiji14 mentioned for ATL06 (refer to manually remove 'z' from datetime string #294 (review) ) are already solved as he mentioned;
The documentation and notebook changes seems also appropriate since there aren't main class/method changes, and they complement the examples provided.

Therefore, I will approve this PR.

ricardobarroslourenco · 2022-03-17T01:20:09Z

@JessicaS11 , I have just approved this PR. Does this triggers automatically the rebuild of the library via Travis, and the versioning to conda-forge?

JessicaS11 · 2022-03-17T18:16:39Z

@ricardobarroslourenco Thanks! Once this PR is merged into development, it will hang out there until the next release (which I'll be putting together this afternoon). After the development branch is merged into main and the release is tagged, that will trigger the builds and posting to PyPI and conda.

ricardobarroslourenco · 2022-03-17T18:21:49Z

Thanks, meanwhile, I am using the fix directly from this PR

liuh886 · 2022-09-26T21:51:52Z

Hi, I got the index error on ATL08 similar to cases here:

Icepyx version 0.6.3. Here is a minimal reproduction:

>>> region_a = ipx.Query('ATL08', any_spatial_extent, any_date_range)
>>> region_a.order_vars.wanted
>>> region_a.order_vars.append(keyword_list=['land', 'land_segments','terrain'])
>>> region_a.order_granules(Coverage=region_a.order_vars.wanted)
>>> region_a.download_granules(path)
>>> pattern = "processed_ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5"
>>> reader = ipx.Read(data_source=path, product="ATL08", filename_pattern=pattern)
You have 1 files matching the filename pattern to be read in.

>>> reader.vars.wanted
>>> reader.vars.append(var_list=['h_te_best_fit'])
>>> ds = reader.load()

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Input In [191], in <cell line: 1>()
----> 1 ds = reader.load()

File ~\.conda\envs\snowdepth\lib\site-packages\icepyx\core\read.py:542, in Read.load(self)
    535 # DevNote: I'd originally hoped to rely on intake-xarray in order to not have to iterate through the files myself,
    536 # by providing a generalized url/source in building the catalog.
    537 # However, this led to errors when I tried to combine two identical datasets because the single dimension was equal.
    538 # In these situations, xarray recommends manually controlling the merge/concat process yourself.
    539 # While unlikely to be a broad issue, I've heard of multiple matching timestamps causing issues for combining multiple IS2 datasets.
    540 for file in self._filelist:
    541     all_dss.append(
--> 542         self._build_single_file_dataset(file, groups_list)
    543     )  # wanted_groups, vgrp.keys()))
    545 if len(all_dss) == 1:
    546     return all_dss[0]

File ~\.conda\envs\snowdepth\lib\site-packages\icepyx\core\read.py:682, in Read._build_single_file_dataset(self, file, groups_list)
    680 grp_path = wanted_groups_list[0]
    681 wanted_groups_list = wanted_groups_list[1:]
--> 682 ds = self._read_single_grp(file, grp_path)
    683 is2ds, ds = Read._add_vars_to_ds(
    684     is2ds, ds, grp_path, wanted_groups_tiered, wanted_dict
    685 )
    687 # if there are any deeper nested variables, get those so they have actual coordinates and add them

File ~\.conda\envs\snowdepth\lib\site-packages\icepyx\core\read.py:602, in Read._read_single_grp(self, file, grp_path)
    598 try:
    599     grpcat = is2cat.build_catalog(
    600         file, self._pattern, self._source_type, grp_paths=grp_path
    601     )
--> 602     ds = grpcat[self._source_type].read()
    604 # NOTE: could also do this with h5py, but then would have to read in each variable in the group separately
    605 except ValueError:

File ~\.conda\envs\snowdepth\lib\site-packages\intake_xarray\base.py:39, in DataSourceMixin.read(self)
     37 def read(self):
     38     """Return a version of the xarray with all the data in memory"""
---> 39     self._load_metadata()
     40     return self._ds.load()

File ~\.conda\envs\snowdepth\lib\site-packages\intake\source\base.py:236, in DataSourceBase._load_metadata(self)
    234 """load metadata only if needed"""
    235 if self._schema is None:
--> 236     self._schema = self._get_schema()
    237     self.dtype = self._schema.dtype
    238     self.shape = self._schema.shape

File ~\.conda\envs\snowdepth\lib\site-packages\intake_xarray\base.py:18, in DataSourceMixin._get_schema(self)
     15 self.urlpath = self._get_cache(self.urlpath)[0]
     17 if self._ds is None:
---> 18     self._open_dataset()
     20     metadata = {
     21         'dims': dict(self._ds.dims),
     22         'data_vars': {k: list(self._ds[k].coords)
     23                       for k in self._ds.data_vars.keys()},
     24         'coords': tuple(self._ds.coords.keys()),
     25     }
     26     if getattr(self, 'on_server', False):

File ~\.conda\envs\snowdepth\lib\site-packages\intake_xarray\netcdf.py:92, in NetCDFSource._open_dataset(self)
     88 else:
     89     # https://github.com/intake/filesystem_spec/issues/476#issuecomment-732372918
     90     url = fsspec.open(self.urlpath, **self.storage_options).open()
---> 92 self._ds = _open_dataset(url, chunks=self.chunks, **kwargs)

File ~\.conda\envs\snowdepth\lib\site-packages\xarray\backends\api.py:531, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, backend_kwargs, **kwargs)
    519 decoders = _resolve_decoders_kwargs(
    520     decode_cf,
    521     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    527     decode_coords=decode_coords,
    528 )
    530 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 531 backend_ds = backend.open_dataset(
    532     filename_or_obj,
    533     drop_variables=drop_variables,
    534     **decoders,
    535     **kwargs,
    536 )
    537 ds = _dataset_from_backend_dataset(
    538     backend_ds,
    539     filename_or_obj,
   (...)
    547     **kwargs,
    548 )
    549 return ds

File ~\.conda\envs\snowdepth\lib\site-packages\xarray\backends\h5netcdf_.py:401, in H5netcdfBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, format, group, lock, invalid_netcdf, phony_dims, decode_vlen_strings)
    389 store = H5NetCDFStore.open(
    390     filename_or_obj,
    391     format=format,
   (...)
    396     decode_vlen_strings=decode_vlen_strings,
    397 )
    399 store_entrypoint = StoreBackendEntrypoint()
--> 401 ds = store_entrypoint.open_dataset(
    402     store,
    403     mask_and_scale=mask_and_scale,
    404     decode_times=decode_times,
    405     concat_characters=concat_characters,
    406     decode_coords=decode_coords,
    407     drop_variables=drop_variables,
    408     use_cftime=use_cftime,
    409     decode_timedelta=decode_timedelta,
    410 )
    411 return ds

File ~\.conda\envs\snowdepth\lib\site-packages\xarray\backends\store.py:26, in StoreBackendEntrypoint.open_dataset(self, store, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta)
     14 def open_dataset(
     15     self,
     16     store,
   (...)
     24     decode_timedelta=None,
     25 ):
---> 26     vars, attrs = store.load()
     27     encoding = store.get_encoding()
     29     vars, attrs, coord_names = conventions.decode_cf_variables(
     30         vars,
     31         attrs,
   (...)
     38         decode_timedelta=decode_timedelta,
     39     )

File ~\.conda\envs\snowdepth\lib\site-packages\xarray\backends\common.py:125, in AbstractDataStore.load(self)
    103 def load(self):
    104     """
    105     This loads the variables and attributes simultaneously.
    106     A centralized loading function makes it easier to create
   (...)
    122     are requested, so care should be taken to make sure its fast.
    123     """
    124     variables = FrozenDict(
--> 125         (_decode_variable_name(k), v) for k, v in self.get_variables().items()
    126     )
    127     attributes = FrozenDict(self.get_attrs())
    128     return variables, attributes

File ~\.conda\envs\snowdepth\lib\site-packages\xarray\backends\h5netcdf_.py:232, in H5NetCDFStore.get_variables(self)
    231 def get_variables(self):
--> 232     return FrozenDict(
    233         (k, self.open_store_variable(k, v)) for k, v in self.ds.variables.items()
    234     )

File ~\.conda\envs\snowdepth\lib\site-packages\xarray\core\utils.py:474, in FrozenDict(*args, **kwargs)
    473 def FrozenDict(*args, **kwargs) -> Frozen:
--> 474     return Frozen(dict(*args, **kwargs))

File ~\.conda\envs\snowdepth\lib\site-packages\xarray\backends\h5netcdf_.py:233, in <genexpr>(.0)
    231 def get_variables(self):
    232     return FrozenDict(
--> 233         (k, self.open_store_variable(k, v)) for k, v in self.ds.variables.items()
    234     )

File ~\.conda\envs\snowdepth\lib\site-packages\xarray\backends\h5netcdf_.py:197, in H5NetCDFStore.open_store_variable(self, name, var)
    194 def open_store_variable(self, name, var):
    195     import h5py
--> 197     dimensions = var.dimensions
    198     data = indexing.LazilyIndexedArray(H5NetCDFArrayWrapper(name, self))
    199     attrs = _read_attributes(var)

File ~\.conda\envs\snowdepth\lib\site-packages\h5netcdf\core.py:252, in BaseVariable.dimensions(self)
    250 """Return variable dimension names."""
    251 if self._dimensions is None:
--> 252     self._dimensions = self._lookup_dimensions()
    253 return self._dimensions

File ~\.conda\envs\snowdepth\lib\site-packages\h5netcdf\core.py:148, in BaseVariable._lookup_dimensions(self)
    145 # normal variable carrying DIMENSION_LIST
    146 # extract hdf5 file references and get objects name
    147 if "DIMENSION_LIST" in attrs:
--> 148     return tuple(
    149         self._root._h5file[ref[0]].name.split("/")[-1]
    150         for ref in list(self._h5ds.attrs.get("DIMENSION_LIST", []))
    151     )
    153 # need to use the h5ds name here to distinguish from collision dimensions
    154 child_name = self._h5ds.name.split("/")[-1]

File ~\.conda\envs\snowdepth\lib\site-packages\h5netcdf\core.py:149, in <genexpr>(.0)
    145 # normal variable carrying DIMENSION_LIST
    146 # extract hdf5 file references and get objects name
    147 if "DIMENSION_LIST" in attrs:
    148     return tuple(
--> 149         self._root._h5file[ref[0]].name.split("/")[-1]
    150         for ref in list(self._h5ds.attrs.get("DIMENSION_LIST", []))
    151     )
    153 # need to use the h5ds name here to distinguish from collision dimensions
    154 child_name = self._h5ds.name.split("/")[-1]

IndexError: index 0 is out of bounds for axis 0 with size 0

If this is not the same bug, I can post a new issue.

-)

JessicaS11 added 7 commits February 15, 2022 17:18

add new boolean flag and note for implementing fix

10e62e9

implement boolean flag for dealing with index error from mistmatched …

dd6b2df

…length var and varpath lists

fix generation of group specific variable list for multiple variable …

5f7850e

…levels

set up structure to handle merge conflicts caused by more highly nest…

c3b5c73

…ed variables

working prototype for deeply nested dataset merging

57fd93d

fix docstring typo

6d9cec6

finish debugging ATL08 read-in issue

763305b

JessicaS11 linked an issue Feb 25, 2022 that may be closed by this pull request

IndexError: list index out of range #277

Closed

clean up code after viewing dif

1e8b785

JessicaS11 requested a review from weiji14 February 25, 2022 16:40

Merge branch 'development' into debug

c385251

JessicaS11 mentioned this pull request Feb 28, 2022

IndexError: list index out of range #277

Closed

JessicaS11 requested a review from linxiong100 March 2, 2022 14:27

JessicaS11 added 2 commits March 7, 2022 14:28

Merge branch 'development' into debug

39a04ce

Merge branch 'development' into debug

c3287b8

ricardobarroslourenco self-requested a review March 14, 2022 19:56

JessicaS11 mentioned this pull request Mar 16, 2022

manually remove 'z' from datetime string #294

Merged

Merge branch 'development' into debug

97a70b2

ricardobarroslourenco approved these changes Mar 17, 2022

View reviewed changes

GitHub action UML generation auto-update

5ee425f

JessicaS11 merged commit c525715 into development Mar 17, 2022

JessicaS11 deleted the debug branch March 17, 2022 18:48

liuh886 mentioned this pull request Sep 29, 2022

read-in: the index error on ATL08 #376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add read-in functionality for deeply nested variables #281

add read-in functionality for deeply nested variables #281

JessicaS11 commented Feb 25, 2022 •

edited

Loading

review-notebook-app bot commented Feb 25, 2022

github-actions bot commented Feb 25, 2022 •

edited

Loading

codecov-commenter commented Feb 25, 2022 •

edited

Loading

weiji14 commented Feb 25, 2022 •

edited

Loading

JessicaS11 commented Feb 25, 2022

ricardobarroslourenco commented Mar 14, 2022

JessicaS11 commented Mar 16, 2022

ricardobarroslourenco commented Mar 16, 2022

JessicaS11 commented Mar 16, 2022

ricardobarroslourenco commented Mar 16, 2022

ricardobarroslourenco left a comment

ricardobarroslourenco commented Mar 17, 2022

JessicaS11 commented Mar 17, 2022

ricardobarroslourenco commented Mar 17, 2022

liuh886 commented Sep 26, 2022

add read-in functionality for deeply nested variables #281

add read-in functionality for deeply nested variables #281

Conversation

JessicaS11 commented Feb 25, 2022 • edited Loading

review-notebook-app bot commented Feb 25, 2022

github-actions bot commented Feb 25, 2022 • edited Loading

codecov-commenter commented Feb 25, 2022 • edited Loading

Codecov Report

weiji14 commented Feb 25, 2022 • edited Loading

JessicaS11 commented Feb 25, 2022

ricardobarroslourenco commented Mar 14, 2022

JessicaS11 commented Mar 16, 2022

ricardobarroslourenco commented Mar 16, 2022

JessicaS11 commented Mar 16, 2022

ricardobarroslourenco commented Mar 16, 2022

ricardobarroslourenco left a comment

Choose a reason for hiding this comment

ricardobarroslourenco commented Mar 17, 2022

JessicaS11 commented Mar 17, 2022

ricardobarroslourenco commented Mar 17, 2022

liuh886 commented Sep 26, 2022

JessicaS11 commented Feb 25, 2022 •

edited

Loading

github-actions bot commented Feb 25, 2022 •

edited

Loading

codecov-commenter commented Feb 25, 2022 •

edited

Loading

weiji14 commented Feb 25, 2022 •

edited

Loading