-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplifying UnionType during concatenation #2182
Comments
I think this would be a useful feature, though probably hidden behind |
Yeah, about this kind of simplification—it's one that has come up before. The option-type and union-type simplifications that we have now (
Merging the types of
as
is both visible at the type level and it loses information. If they're merged as a union, then each value either has fields If we make that change, not only would we break things for current users, but we also wouldn't be able to back out of it easily if we decide that we needed the lost information after all. (We can't change it by calling it a bug-fix: the current behavior really was/is intended.) Perhaps we could add a function that does this union-squashing manually, so that it's an opt-in thing (and AnnData could opt into it on the whole-library level, so that your users don't have to think about it). The hard part, then, is what should this function be named? (The implementation would not be very complicated, through |
Thanks for the clarification. An explicit function to simplify types sounds good! Some naming ideas
|
Perhaps Footnotes
|
awkward/src/awkward/contents/unionarray.py Lines 598 to 630 in cda7688
It would be better to make it apply only to this case, so that capabilities can be pieced together (like the way that The name should include both "union" and "record," since it turns unions of records into records with option-type fields. "Squash" sounds like "flatten," and that refers to list-type dimensions. How about |
I'm trying to wrap my head around what
Especially:
|
Most explicitly, the definition of option is taken from the datashape grammar that we use: https://datashape.readthedocs.io/en/latest/overview.html#option Semantically, yes, an option is well defined by So, TLDR: at the layout level, That said, Jim's example
might seem confusing — it implies that the missing values in ak.contents.UnmaskedArray(ak.contents.EmptyArray()) I hope that I haven't made this any more confusing. Jim might have a keener insight into your question. |
Datashape doesn't have the concept of unions though, right?
This one is a little confusing to me, since you can also have empty typed arrays (though I haven't figured out an easy way to create them). Like: ak.to_packed(ak.Array([[1], [], [], []])[1:])
# <Array [[], [], []] type='3 * var * int64'> But I think the concepts of unknown and empty are a little conflated here. |
I've been playing around a little more. I think this little snippet helped elucidate some things for me (the output is quite long): arrays = [
ak.pad_none(ak.contents.EmptyArray(), 1, 0),
ak.Array([[{"a": 1, "b": "foo"}]]),
ak.Array([[{"a": 2}]]),
]
for perm in permutations(arrays):
try:
res = ak.concatenate(perm)
res.type.show()
print(ak.to_buffers(res)[0])
except:
print("Error") This does not return a consistent type. It returns both:
and
Each of which has a distinct form/ layout. If |
Correct! See here.
Yes, the name would be better as
You example with merging is strongly related to the bug that this issue originally reports. Simply put, strings are implemented as views over character arrays. This view-metadata is being lost with our merging logic. |
You're right, @ivirshup, about how missing values could be implemented; we just made different choices here. Actually, I can think of three different ways to represent missing values that I'll outline below. First, though, it would be useful to distinguish these two types:
(The usefulness of a bottom type is that it is an identity under type-unification. Suppose you have an The Awkward Array doesn't have a unit type. If it did ( The three ways that I'm aware of for implementing missing values are:
(1) has a nice conceptual feature that you can reuse the concept of I just spent a while wondering why Julia, a compiled language, goes with (1), rather than (2), and it's because they're working with non-columnar data structures. As record-oriented structures, struct UnionOfNullAnd{T}
tag::UInt8
null_value::Nothing
other_value::T
end is effectively the same as struct Optional{T}
is_missing::Bool
other_value::T
end since the So if we used representation (1), we'd be forced to use a mask-like The statements I made about
are independent of representation. You'd want these type expressions to be equivalent in a high-level view, although all of the representations, (1), (2), and (3), would have to do some work to make them equivalent. Maybe Julia and other compiled languages would lose the wrapping-structs in some compiler optimization pass, so maybe they don't have to deal with it explicitly.1 We have to, and we do it (now) by imposing a conventional layout when the nodes are being constructed. (We used to do it after the nodes were constructed but before they were returned to the user, but missed some cases.) Footnotes
|
My last message here was addressing @grst's original issue, about I had missed this point:
This is showing non-commutativity of the type that comes from |
Thanks for the explanation @jpivarski, I think this was helpful. Especially seeing that I think I've figured out how we're going to end up working with this on anndata once this feature is out (scverse/anndata#898). Any idea on the timeframe for a release with Another question, are there other parts of the high level API that let me construct unions of records? I think I've only been seeing these from
As a minor point, I don't think this description is quite accurate. Julia does use a "columnar" representation for null values in arrays. From the blog post describing the feature (which I think was quite good): Quote describing how Array{Missing{T}} works
|
Probably early next week, but I can do a patch release (2.0.7) that would include
It would be hard to enumerate them, but there are others. One that comes to mind is ak.fill_none, since the fill value might not match the type of the nullable data: >>> array = ak.Array([1, 2, 3, None, 4, 5, None, 6])
>>> array.show(type=True)
type: 8 * ?int64
[1,
2,
3,
None,
4,
5,
None,
6]
>>> array2 = ak.fill_none(array, "hello")
>>> array2.show(type=True)
type: 8 * union[
int64,
string
]
[1,
2,
3,
'hello',
4,
5,
'hello',
6]
I see! I followed through to the deprecated DataArrays.jl (because "In fact, the layout of the |
This should have been fixed by |
Description of new feature
In #2058, you concluded that it is more useful if
ak.concatenate
simplifies types where possible instead of creating a Union. Here I have another example where I think this would be useful:Current behavior
Like this, I can't even access fields anymore after concatenation:
Desired behavior
Create an option type
FWIW, awkward infers exactly this option type when passing the concatenated array through a python object:
CC @ivirshup
The text was updated successfully, but these errors were encountered: