Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of bounds data indexing behaviour #291

Open
sadielbartholomew opened this issue Mar 6, 2024 · 2 comments
Open

Out of bounds data indexing behaviour #291

sadielbartholomew opened this issue Mar 6, 2024 · 2 comments
Labels
question Further information is requested

Comments

@sadielbartholomew
Copy link
Member

sadielbartholomew commented Mar 6, 2024

Here in cfdm and downstream in cf-python, when attempting to (mistakenly) access data which is out of bounds for the given construct, differing behaviour is produced depending on whether we attempt the __getitem__ on the construct itself, or on the underlying Data accessed e.g. via <construct>.data. See example below. My questions are:

  • is this what we expect? (my thoughts: not really, consistency would be nice)
  • is this a facet of Dask under-the-hood and its laziness that we can't change? (may be so, but I am not sure?)
  • for the case where we error, isn't the error message a bit cryptic relative to something more direct and clear? (my feeling is we should improve the error to make it more akin to an IndexError with error message index X is out of bounds for axis N with size Y as returned by numpy for the same case, e.g. np.array(range(36))[100])

Example (using Python 3.12 and the latest version of cf*)

>>> t = cfdm.example_field(2).dimension_coordinate('time')
>>> t
<DimensionCoordinate: time(36) days since 1959-01-01 >
>>> # Attempt to get item that is out of bounds directly: it errors
>>> t[100]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/slb93/git-repos/cfdm/cfdm/mixin/propertiesdatabounds.py", line 170, in __getitem__
    new = super().__getitem__(indices)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/slb93/git-repos/cfdm/cfdm/mixin/propertiesdata.py", line 75, in __getitem__
    raise IndexError(
IndexError: Indices (100,) result in a subspaced shape of (0,), but can't create a subspace of DimensionCoordinate that has a size 0 axis
>>> 
>>> # But attempt to access it via the underlying Data, it returns 0 Data
>>> t.data[100]
<Data(0): >
@sadielbartholomew sadielbartholomew added the question Further information is requested label Mar 6, 2024
@davidhassell
Copy link
Contributor

Hi Sadie, this is interesting!

  1. is this what we expect? (my thoughts: not really, consistency would be nice)
  • Yes A pure cf.Data object should behave much like a Dask or numpy array, which can have zero size. However, we do have two deviations from this: never dropping dimensions, and orthogonal indexing. Constructs, however, have constraints, such they can't be scalar, nor have zero size, we need to trap this on them.
  1. is this a facet of Dask under-the-hood and its laziness that we can't change? (may be so, but I am not sure?)
  • No
  1. or the case where we error, isn't the error message a bit cryptic relative to something more direct and clear? (my feeling is we should improve the error to make it more akin to an IndexError with error message index X is out of bounds for axis N with size Y as returned by numpy for the same case, e.g. np.array(range(36))[100])
  • Tricky. Consider:
>>> a = t.array
>>> a[100]  # Drops axis and fails
IndexError: index 100 is out of bounds for axis 0 with size 36
>>> a[slice(100,101)]  # Retains axis and returns with zero size
array([], dtype=float64)
>>> a[slice(2, 2)] 
array([], dtype=float64)

This is what is going on with cf.Data - it converts 100 to slice(100, 101) prior to applying to the (numpy or Dask) array. This because we are presuming to keep axes (in cf-python this is tweakable).

So, we don't really know why the array was zero size - was it out of bounds or something else?

In cf-python we can do

>>> t.data[100]
<CF Data(0): >
>>> t.data.__keepdims_indexing__ = False
>>> t.data[100]
IndexError: Index 100 is out of bounds for axis 0 with size 36

because t.data.__keepdims_indexing__ = False prevents 100 from being converted to a size 1 slice.

Does any of that make sense?

@sadielbartholomew
Copy link
Member Author

Hi @davidhassell, thanks so much for your detailed explanations here. I think it all makes sense to me (if it doesn't it is because I am trying to wrap my head around, rather than your comment which is very detailed and well explained), though I might ask some follow-up questions in our catch up this afternoon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants