-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #3177 Index.sort_values #3235
base: master
Are you sure you want to change the base?
Conversation
21af366
to
6f8dab4
Compare
d09dc21
to
02f814c
Compare
9638b14
to
ad6be3a
Compare
1004ddf
to
ffc5d4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at the tests terribly closely yet. I left a few comments for you to check out but the most important is I stumbled upon a bug due to using ak_isnan(self.values)
instead of ak_isnan(self.values)[perm]
>>> vals = ak.array([4, np.nan, 1, np.nan, 0, np.nan, 3, np.nan, 2])
>>> ak.Index(vals).sort_values()
Index(array([0.0 2.0 4.0 nan nan 1.0 3.0 nan nan]), dtype='float64')
arkouda/categorical.py
Outdated
if ascending is True: | ||
return sorted_array | ||
else: | ||
return sorted_array[arange(sorted_array.size - 1, -1, -1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to think how sorted_array[arange(sorted_array.size - 1, -1, -1)]
differs from something like sorted_array[::-1]
. I can't think of any benefit to creating the array and the slice avoids having to allocate the memory for the indexing array
I would normally use the slice here, but even that seems not ideal since pdarray indexing is kinda aimed for the general random access into the array case. But in the reverse an array case there's so much structure, you'd think we could do better. One thing I found when I searched our codebase was flip
added by @jeremiah-corrado for array api stuff. The implementation is for arbitrary dimensions and the details are a bit hard for me to parse through, so it's not immediately if this would have better performance but it's something that def worth looking into since reversing an array pops up a lot
python code:
arkouda/arkouda/array_api/manipulation_functions.py
Lines 122 to 132 in 74b5889
def flip(x: Array, /, *, axis: Optional[Union[int, Tuple[int, ...]]] = None) -> Array: | |
""" | |
Reverse an array's values along a particular axis or axes. | |
Parameters | |
---------- | |
x : Array | |
The array to flip | |
axis : int or Tuple[int, ...], optional | |
The axis or axes along which to flip the array. If None, flip the array along all axes. | |
""" |
chpl code:
arkouda/src/ManipulationMsg.chpl
Line 373 in 74b5889
proc flipMsg(cmd: string, msgArgs: borrowed MessageArgs, st: borrowed SymTab, param nd: int): MsgTuple throws { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, that implementation of flip is not very efficient because it has to generalize to any number of dimensions. I'd expect it would give you about the same performance as sorted_array[::-1]
.
A good strategy might be to use flipMsg
for now, where I'll probably add a more efficient 1D path at some point in the future. And then we won't have to do anything for sort_values
to take advantage of those performance improvements when that happens. (This will require uncommenting this line to always build ManipulationMsg — which is a switch we were going to have to flip at some point to support other numpy-like functionality in the main arkouda api).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense! I agree,I think enabling ManipulationMsg
in the server modules config file and using flip here is the way to go
if isinstance(lst, list): | ||
d = dtype(type(lst[0])) | ||
for item in lst: | ||
assert dtype(type(item)) == d, ( | ||
f"Values of Index must all be same type. " | ||
f"Types {d} and {dtype(type(item))} do not match." | ||
) | ||
return d | ||
return np.array(lst).dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pointing out that this used to error on a list of mixed types. I think this a great simplification as long as that's the behavior we're going for
In [4]: def _dtype_of_lst(lst):
...: d = ak.dtype(type(lst[0]))
...: for item in lst:
...: assert ak.dtype(type(item)) == d, (
...: f"Values of Index must all be same type. "
...: f"Types {d} and {ak.dtype(type(item))} do not match."
...: )
...: return d
...:
In [5]: lst = [1, 1.0, 'hi']
In [6]: _dtype_of_lst(lst)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
AssertionError: Values of Index must all be same type. Types int64 and float64 do not match.
In [7]: np.array(lst).dtype
Out[7]: dtype('<U32')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. The only place this is used is to set the dtype of an index that has a list for values. Since a mixed type list can converted to a pdarray (aparently), I can't think of why this would be a problem?
In [7]: l = ["abc", 1, 1.3]
In [8]: np.array(l).dtype
Out[8]: dtype('<U32')
In [9]: ak.array(l)
Out[9]: array(['abc', '1', '1.3'])
arkouda/index.py
Outdated
Return a sorted copy of the index. | ||
|
||
Return a sorted copy of the index, and optionally return the indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we say Return a sorted copy of the index
twice. Idk if that's on purpose or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I combined these to simplify. Thanks for pointing this out.
arkouda/index.py
Outdated
is_nan = np_isnan(self.values) | ||
nan_vals = [i for (i, b) in zip(perm, is_nan) if b] | ||
not_nan_vals = [i for (i, b) in zip(perm, is_nan) if not b] | ||
|
||
if na_position == "last": | ||
perm = [*not_nan_vals, *nan_vals] # type: ignore | ||
elif na_position == "first": | ||
perm = [*nan_vals, *not_nan_vals] # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't similar logic for the pdarray case work here? I know these values are relatively small since it's a list so perf doesn't really matter but doing a second list comprehension just to get the negation of the first seems kinda unnecessary. Plus this would have the added benefit of not having to think to hard to understand this case if you already understand the pdarray case
is_nan = np_isnan(self.values)
perm = np.array(perm)
if na_position == "last":
perm = np.concatenate([perm[~is_nan], perm[is_nan]])
elif na_position == "first":
perm = np.concatenate([perm[is_nan], perm[~is_nan]])
perm = perm.tolist()
you could also use a ternary for the middle bit you're feeling extra since you've already ensured that na_position
is either "first"
or "last"
also super not necessary, but I think a neat little trick is to use the python builtin sorted
here since we can use the key
to sort on is_nan
where False / 0
comes before True/1
, so non-nan's come first. we'd want to reverse
if na_position == "first"
perm = [p for _, p in sorted(zip(np.isnan(self.values), perm), key=lambda pair: pair[0], reverse=(na_position == "first"))]
This could also be broken out to be more than one line for readability if you prefer. But once again this is all nitpicking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is a lot but the result is actually important lol
While sanity checking these are all equivalent, I actually discovered that we need to set is_nan = np_isnan(self.values)[perm]
in order to get the correct behavior. I think this also applies to the pdarray case
In [2]: vals = [4, np.nan, 1, np.nan, 0, np.nan, 3, np.nan, 2]
In [3]: perm = np.argsort(vals).tolist()
In [4]: np_vals = np.array(vals)
In [5]: def argsort_methods(is_nan):
...: # Original method
...: nan_vals = [i for (i, b) in zip(perm, is_nan) if b]
...: not_nan_vals = [i for (i, b) in zip(perm, is_nan) if not b]
...: orig_last = [*not_nan_vals, *nan_vals]
...: orig_first = [*nan_vals, *not_nan_vals]
...: print("ORIG METHOD")
...: print("first: ", orig_first)
...: print("last: ", orig_last)
...:
...: # similar to pdarray case
...: np_perm = np.array(perm)
...: sim_last = np.concatenate([np_perm[~is_nan], np_perm[is_nan]])
...: sim_first = np.concatenate([np_perm[is_nan], np_perm[~is_nan]])
...: print("SIMILAR METHOD")
...: print("first: ", sim_first.tolist())
...: print("last: ", sim_last.tolist())
...:
...: # using sorted builtin
...: sorted_first = [p for _, p in sorted(zip(is_nan, perm), key=lambda pair: pair[0], reverse=True)]
...: sorted_last = [p for _, p in sorted(zip(is_nan, perm), key=lambda pair: pair[0])]
...: print("SORTED METHOD")
...: print("first: ", sorted_first)
...: print("last: ", sorted_last)
...:
...: return orig_first, orig_last
...:
In [6]: first, last = argsort_methods(np.isnan(vals))
ORIG METHOD
first: [2, 6, 1, 5, 4, 8, 0, 3, 7]
last: [4, 8, 0, 3, 7, 2, 6, 1, 5]
SIMILAR METHOD
first: [2, 6, 1, 5, 4, 8, 0, 3, 7]
last: [4, 8, 0, 3, 7, 2, 6, 1, 5]
SORTED METHOD
first: [2, 6, 1, 5, 4, 8, 0, 3, 7]
last: [4, 8, 0, 3, 7, 2, 6, 1, 5]
In [7]: np_vals[first]
Out[7]: array([ 1., 3., nan, nan, 0., 2., 4., nan, nan])
In [8]: np_vals[last]
Out[8]: array([ 0., 2., 4., nan, nan, 1., 3., nan, nan])
In [9]: first, last = argsort_methods(np.isnan(vals)[perm])
ORIG METHOD
first: [1, 3, 5, 7, 4, 2, 8, 6, 0]
last: [4, 2, 8, 6, 0, 1, 3, 5, 7]
SIMILAR METHOD
first: [1, 3, 5, 7, 4, 2, 8, 6, 0]
last: [4, 2, 8, 6, 0, 1, 3, 5, 7]
SORTED METHOD
first: [1, 3, 5, 7, 4, 2, 8, 6, 0]
last: [4, 2, 8, 6, 0, 1, 3, 5, 7]
In [10]: np_vals[first]
Out[10]: array([nan, nan, nan, nan, 0., 1., 2., 3., 4.])
In [11]: np_vals[last]
Out[11]: array([ 0., 1., 2., 3., 4., nan, nan, nan, nan])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the changes and added an extra unit test.
arkouda/index.py
Outdated
|
||
from arkouda import isnan as ak_isnan | ||
|
||
is_nan = ak_isnan(self.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be ak_isnan(self.values)[perm]
. The list case has more info on how I came about this but running the following code using this PR gives incorrect results:
>>> vals = ak.array([4, np.nan, 1, np.nan, 0, np.nan, 3, np.nan, 2])
>>> ak.Index(vals).sort_values()
Index(array([0.0 2.0 4.0 nan nan 1.0 3.0 nan nan]), dtype='float64')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the changes and added an extra unit test.
arkouda/series.py
Outdated
# For non-numeric values, need the descending arange because reverse slicing | ||
# is not supported | ||
idx = argsort(self.values)[arange(self.values.size - 1, -1, -1)] | ||
perm = argsort(self.values)[arange(self.values.size - 1, -1, -1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh, i'm not sure when / if this comment used to be true. But it's not the case anymore
>>> vals = ak.array(['c', 'a', 'b', 'c', 'c', 'b'])
>>> ak.argsort(vals)[::-1]
array([4 3 0 5 2 1])
>>> vals[ak.argsort(vals)[::-1]]
array(['c', 'c', 'c', 'b', 'b', 'a'])
using git blame it looks like it was present when series was originally ported over from user provided code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, OK, I will simplify this. Thanks for noticing and pointing this out.
arkouda/sorting.py
Outdated
else: | ||
from arkouda import arange | ||
|
||
return sorted_array[arange(sorted_array.size - 1, -1, -1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I've commented on similar code elsewhere but want to make sure this spot doesn't fall through the cracks
arkouda/sorting.py
Outdated
else: | ||
from arkouda import arange | ||
|
||
return sorted_array[arange(sorted_array.size - 1, -1, -1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
ffc5d4b
to
cad665d
Compare
dc2b9be
to
85d8caf
Compare
Closes #3177 Index.sort_values