Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We should control the type of ak.Array.attrs, and it should be immutable #3277

Open
jpivarski opened this issue Oct 8, 2024 · 0 comments
Open
Labels
policy Choice of behavior

Comments

@jpivarski
Copy link
Member

Right now,we just take the attrs that a user gives us and pass it around, allowing it to be mutated arbitrarily.

>>> import awkward as ak
>>> class MyDict(dict): pass
... 
>>> attrs = MyDict(one=1)

>>> array1 = ak.Array([1, 2, 3], attrs=attrs)
>>> type(array1.attrs)
<class '__main__.MyDict'>
>>> array1.attrs["two"] = 2

>>> array2 = array1 * 1.1
>>> type(array2.attrs)
<class '__main__.MyDict'>
>>> array2.attrs["three"] = 3

>>> array2.attrs
{'one': 1, 'two': 2, 'three': 3}
>>> array1.attrs
{'one': 1, 'two': 2, 'three': 3}
>>> attrs
{'one': 1, 'two': 2, 'three': 3}

This is not how Awkward Array things usually work, usually we

  • make a copy to unlink "our" version of the data from the user-provided collection
  • regularize the type to simplify our internal code by being able to make assumptions about its type
  • prefer to limit mutability; the ak.Array object is mutable, but so far, none of the attributes hanging off of the ak.Array objects have been mutable

This is all just a matter of expectations and different users will have different expectations, so this issue is labeled "policy" and we can brainstorm in the comments. However, I think the current state of affairs becomes a problem when people serialize/deserialize through Arrow/Parquet, since serialization

  • copies data, unlinking it from the original
  • can't have unrecognized types like MyDict
  • is essentially immutable

Here's what I think it should do instead. I think that, given an arbitrary attrs Mapping, we should copy it into a frozendict-like type that we control, pass that frozendict from arrays to derived arrays, and replace it when modified by __setitem__, similar to the way that ak.Array.__setitem__ replaces its layout, rather than changing its layout in-place. (Because layout nodes are highly shared among arrays; the ak.Array outer shell is not.)

Our attrs type would be something like

class Attrs:
    def __init__(self, parent: ak.Array, data: Mapping[str, Any]):
        self._parent = parent
        self._data = dict(data)

    def __getitem__(self, key: str):
        return self._data[key]

    def __setitem__(self, key: str, value: Any):
        self._parent._attrs = Attrs(self._parent, self._data | {key: value})

The effect of this is that it would still allow

>>> array1.attrs["one"] = 1
>>> array1.attrs
{"one": 1}

and attrs would be passed to derived arrays

>>> array2 = array1 * 1.1
>>> array2.attrs
{"one": 1}

but influence would not be propagated backward:

>>> array2.attrs["two"] = 2
>>> array2.attrs
{"one": 1, "two": 2}
>>> array1.attrs
{"one": 1}

This is what I think Awkward Arrays should do, but it's open for discussion as to whether it's too weird for Python. It fits in better with what serialized data does: if you save an array to a file and open it in another process and start making changes, it absolutely won't affect the original array—this ensures the same semantics for derived arrays regardless of whether they're saved to files.

Side-note: parameters should do this, too. This issue would be resolved by ensuring that attrs and parameters both have this semantics, or by ensuring that parameters are strictly immutable.

>>> layout1 = ak.contents.NumpyArray(np.array([1, 2, 3]), parameters={"one": 1})
>>> layout1.parameters
{'one': 1}
>>> layout2 = layout1[[2, 0, 0, 1]]
>>> layout2.parameters["two"] = 2
>>> layout2.parameters
{'one': 1, 'two': 2}
>>> layout1.parameters
{'one': 1, 'two': 2}

A note on performance: frozendicts don't need to be copied deeply. They can be passed around by reference. If the Attrs class is implemented with a reference to a parent, then this is the only thing that needs to change when deriving one array from another.

A note on memory management: the way that I described it above, with a reference to parent, introduces a cyclic reference (because the ak.Array has to have a reference to the Attrs). I think this cycle can be broken with two types; one is a short-lived proxy.

A note on deprecation: this would change semantics in two things that have already been "in the wild," both attrs and parameters. But there's a natural place to put the deprecation warning—in the __setitem__ call. This can go through the normal two-version deprecation cycle.

If this seems to be too big of a change, here's a more conservative suggestion: keep the value/reference semantics that we have now, but at least ensure that the internal representation is a dict. We could refuse to accept a general Mapping or MutableMapping and only accept simple dicts, since we know we'll be able to reconstitute such a thing when deserializing from a file. (This is the opposite of what I told @pfackeldey in #3238.)

@jpivarski jpivarski added the policy Choice of behavior label Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
policy Choice of behavior
Projects
None yet
Development

No branches or pull requests

1 participant