Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Children proxy interface #9477

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from
77 changes: 74 additions & 3 deletions xarray/core/treenode.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,12 @@
from pathlib import PurePosixPath
from typing import (
TYPE_CHECKING,
Any,
Generic,
TypeVar,
)

from xarray.core.utils import Frozen, is_dict_like
from xarray.core.utils import is_dict_like

if TYPE_CHECKING:
from xarray.core.types import T_DataArray
Expand Down Expand Up @@ -44,6 +45,76 @@ def __init__(self, *pathsegments):
Tree = TypeVar("Tree", bound="TreeNode")


class Children(Mapping[str, Tree], Generic[Tree]):
"""
Dictionary-like container for the immediate children of a single DataTree node.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether this class should allow path-like access to grandchildren or only allow access to immediate children.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question also for setting

Copy link
Contributor Author

@TomNicholas TomNicholas Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thoughts I think this class should only allow getting/setting immediate children (because it's basically a glorified dict).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thoughts I think this class should only allow getting/setting immediate children (because it's basically a glorified dict).

Agreed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relates to #9485 though - why exactly do we want to support full paths for .coords but only local access for .children? We could do whatever, but what's the rationale for them being different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're basically thinking of .coords/.children as meaning "access to all coords/children defined anywhere on the subtree". I originally thought of them as "access to just coords/children defined on this node".

FYI path-like access like this means that list(dt.children) will return a subset of what can be accessed via .__getitem__, (and __contains__ could go either way, see #9354)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One compromise option of sorts (similar to what is done for Dataset for derived variables like time.year, or for coordinates in Dataset.__getitem__) is to only list immediate children in __iter__ and __contains__ but to support accessing them in __getitem__.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a detailed discussion about this on the old repo (xarray-contrib/datatree#240 (comment)). There the conclusion was that __contains__ should support paths if __getitem__ does, but it's okay for .keys()/.values()/.items() to be local only.

So you're basically thinking of .coords/.children as meaning "access to all coords/children defined anywhere on the subtree".

To play devil's advocate a bit: If .coords can access the whole tree then why not also .variables? I think they should all be consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a detailed discussion about this on the old repo (xarray-contrib/datatree#240 (comment)). There the conclusion was that __contains__ should support paths if __getitem__ does, but it's okay for .keys()/.values()/.items() to be local only.

Yes, that seems to be how Dataset works for coordinates.

To play devil's advocate a bit: If .coords can access the whole tree then why not also .variables? I think they should all be consistent.

Sure, we could do that. Hopefully it's all very straightforwardly reusing the same machinery?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that seems to be how Dataset works for coordinates.

I'm not sure what you mean. You're talking about virtual vs non-virtual coordinates on Dataset? That they are all in __contains__ and __getitem__, but only a subset (the non-virtual ones) are listed by .keys()?

If .coords can access the whole tree then why not also .variables?

Sure, we could do that.

I actually meant that as a example of counter-intuitive behaviour 😅 . But maybe it's fine too? That would mean we have really done a 180 degree turn in this issue... It would imply changing the current behaviour of dt.data_vars too.

Hopefully it's all very straightforwardly reusing the same machinery?

It's not right now (local vs path-supporting have different codepaths), but it could be made to: everything could go through some version of using TreeNode._set_item

def _set_item(

or ._walk_to and we could add a parameter like traverse_paths to generalize it for both cases. Generalizing .update to support paths might be the approach that is easiest to integrate with the Coordinates base class.

Also one final thing: We are talking here about supporting paths to descendant nodes only? Or upwards via dt['../../path/to/node/'] too?


This collection can be passed directly to the :py:class:`~xarray.DataTree` constructor via its `children` argument.
"""

_treenode: Tree

# TODO add slots?
# __slots__ = ("_data",)

def __init__(self, treenode: Tree):
self._treenode = treenode

@property
def _names(self) -> list[str]:
return list(self._treenode._children.keys())
Comment on lines +64 to +65
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not directly use the set-like DictKeys object?

Suggested change
def _names(self) -> list[str]:
return list(self._treenode._children.keys())
def _names(self) -> Set[str]:
return self._treenode._children.keys()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can probably use .keys() you're right, but I don't think I should type this as returning Set, because I originally tried return set(self._treenode._children) and that failed intermittently due to ordering differences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of collections.abc.Set, which is an abstract base class that includes DictKeys, not the specific built-in set class


def __iter__(self) -> Iterator[str]:
return iter(self._names)

def __len__(self) -> int:
return len(self._names)

def __contains__(self, key: str) -> bool:
return key in self._names

def __repr__(self) -> str:
# TODO
from xarray.core import formatting

return formatting.children_repr(self)

def __getitem__(self, key: str) -> Tree:
return self._treenode._children[key]

def __delitem__(self, key: str) -> None:
if key in self._names:
child = self._treenode._children[key]
del self._treenode._children[key]
child.orphan()
else:
raise KeyError(key)

def __setitem__(self, key: str, value: Any) -> None:
self.update({key: value})

def update(self, other: Mapping[str, Tree]) -> None:
"""Update with other child nodes."""

# TODO forbid strings with `/` in here?

if not len(other):
return
Comment on lines +98 to +99
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's slightly more canonical Python to avoid the call to len():

Suggested change
if not len(other):
return
if not other:
return


children = self._treenode._children.copy()
children.update(other)
self._treenode.children = children
Comment on lines +101 to +103
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to verify, this is a no-op if children have incompatible indexes, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .children setter defined on DataTree (and TreeNode) will check for name collisions between variables and children, and if it fails it will avoid modification in-place (we can thank the anytree library for that bit of code 1). In the same place then check_alignment also gets called. Is that what you meant?

Footnotes

  1. Should we adjust the licensing part of the readme as we're bundling small portions of anytree code? We did already add it to the licenses directory here.


def _ipython_key_completions_(self):
"""Provide method for the key-autocompletions in IPython."""
# TODO
return [
key
for key in self._treenode._ipython_key_completions_()
if key not in self._treenode.variables
]


class TreeNode(Generic[Tree]):
"""
Base class representing a node of a tree, with methods for traversing and altering the tree.
Expand Down Expand Up @@ -160,9 +231,9 @@ def orphan(self) -> None:
self._set_parent(new_parent=None)

@property
def children(self: Tree) -> Mapping[str, Tree]:
def children(self: Tree) -> Children[str, Tree]:
"""Child nodes of this node, stored under a mapping via their names."""
return Frozen(self._children)
return Children(self)

@children.setter
def children(self: Tree, children: Mapping[str, Tree]) -> None:
Expand Down
58 changes: 57 additions & 1 deletion xarray/tests/test_treenode.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,13 @@
import pytest

from xarray.core.iterators import LevelOrderIter
from xarray.core.treenode import InvalidTreeError, NamedNode, NodePath, TreeNode
from xarray.core.treenode import (
Children,
InvalidTreeError,
NamedNode,
NodePath,
TreeNode,
)


class TestFamilyTree:
Expand Down Expand Up @@ -224,6 +230,56 @@ def test_overwrite_child(self):
assert marys_evil_twin.parent is john


class TestChildren:
Copy link
Contributor Author

@TomNicholas TomNicholas Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit awkward to know whether to put tests for this class in test_treenode.py or test_datatree.py.

On the one hand the concept of access to .children is very much part of the TreeNode structure - it has nothing to do with .data. On the other hand it's hard to properly test it without full features that are only available on DataTree, such as .copy, a nice repr, and assertions. It's also a fully public-facing class, unlike TreeNode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot less awkward after #9482.

def test_properties(self):
sue: TreeNode = TreeNode()
mary: TreeNode = TreeNode(children={"Sue": sue})
kate: TreeNode = TreeNode()
john = TreeNode(children={"Mary": mary, "Kate": kate})

children = john.children
assert isinstance(children, Children)

# len
assert len(children) == 2

# iter
assert list(children) == ["Mary", "Kate"]

assert john.children["Mary"] is mary
assert john.children["Kate"] is kate

assert "Mary" in john.children
assert "Kate" in john.children
assert 0 not in john.children
assert "foo" not in john.children

# only immediate children should be accessible
assert "sue" not in john.children

with pytest.raises(KeyError):
children["foo"]
with pytest.raises(KeyError):
children[0]

# TODO not sure what this should look like...
# repr
# expected = dedent(
# """\
# Children:
# * x (x) int64 16B -1 -2
# * y (y) int64 24B 0 1 2
# a (x) int64 16B 4 5
# b int64 8B -10"""
# )
# actual = repr(coords)
# assert expected == actual

def test_modify(self):
# TODO
...


class TestPruning:
def test_del_child(self):
john: TreeNode = TreeNode()
Expand Down
Loading