Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #3177 Index.sort_values #3235

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ajpotts
Copy link
Contributor

@ajpotts ajpotts commented May 23, 2024

Closes #3177 Index.sort_values

@ajpotts ajpotts force-pushed the 3177-Index.sort_values branch 2 times, most recently from 21af366 to 6f8dab4 Compare May 23, 2024 20:35
@ajpotts ajpotts changed the title 3177 index.sort values Closes #3177 index.sort values May 23, 2024
@ajpotts ajpotts changed the title Closes #3177 index.sort values Closes #3177 Index.sort_values May 23, 2024
@ajpotts ajpotts force-pushed the 3177-Index.sort_values branch 7 times, most recently from d09dc21 to 02f814c Compare May 30, 2024 15:48
@ajpotts ajpotts marked this pull request as ready for review May 30, 2024 15:48
@ajpotts ajpotts force-pushed the 3177-Index.sort_values branch 4 times, most recently from 9638b14 to ad6be3a Compare May 31, 2024 13:37
@ajpotts ajpotts force-pushed the 3177-Index.sort_values branch from 1004ddf to ffc5d4b Compare June 10, 2024 20:22
Copy link
Member

@stress-tess stress-tess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the tests terribly closely yet. I left a few comments for you to check out but the most important is I stumbled upon a bug due to using ak_isnan(self.values) instead of ak_isnan(self.values)[perm]

>>> vals = ak.array([4, np.nan, 1, np.nan, 0, np.nan, 3, np.nan, 2])
>>> ak.Index(vals).sort_values()
Index(array([0.0 2.0 4.0 nan nan 1.0 3.0 nan nan]), dtype='float64')

Comment on lines 866 to 869
if ascending is True:
return sorted_array
else:
return sorted_array[arange(sorted_array.size - 1, -1, -1)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to think how sorted_array[arange(sorted_array.size - 1, -1, -1)] differs from something like sorted_array[::-1]. I can't think of any benefit to creating the array and the slice avoids having to allocate the memory for the indexing array

I would normally use the slice here, but even that seems not ideal since pdarray indexing is kinda aimed for the general random access into the array case. But in the reverse an array case there's so much structure, you'd think we could do better. One thing I found when I searched our codebase was flip added by @jeremiah-corrado for array api stuff. The implementation is for arbitrary dimensions and the details are a bit hard for me to parse through, so it's not immediately if this would have better performance but it's something that def worth looking into since reversing an array pops up a lot

python code:

def flip(x: Array, /, *, axis: Optional[Union[int, Tuple[int, ...]]] = None) -> Array:
"""
Reverse an array's values along a particular axis or axes.
Parameters
----------
x : Array
The array to flip
axis : int or Tuple[int, ...], optional
The axis or axes along which to flip the array. If None, flip the array along all axes.
"""

chpl code:

proc flipMsg(cmd: string, msgArgs: borrowed MessageArgs, st: borrowed SymTab, param nd: int): MsgTuple throws {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, that implementation of flip is not very efficient because it has to generalize to any number of dimensions. I'd expect it would give you about the same performance as sorted_array[::-1].

A good strategy might be to use flipMsg for now, where I'll probably add a more efficient 1D path at some point in the future. And then we won't have to do anything for sort_values to take advantage of those performance improvements when that happens. (This will require uncommenting this line to always build ManipulationMsg — which is a switch we were going to have to flip at some point to support other numpy-like functionality in the main arkouda api).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense! I agree,I think enabling ManipulationMsg in the server modules config file and using flip here is the way to go

Comment on lines 176 to +185
if isinstance(lst, list):
d = dtype(type(lst[0]))
for item in lst:
assert dtype(type(item)) == d, (
f"Values of Index must all be same type. "
f"Types {d} and {dtype(type(item))} do not match."
)
return d
return np.array(lst).dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pointing out that this used to error on a list of mixed types. I think this a great simplification as long as that's the behavior we're going for

In [4]: def _dtype_of_lst(lst):
   ...:     d = ak.dtype(type(lst[0]))
   ...:     for item in lst:
   ...:         assert ak.dtype(type(item)) == d, (
   ...:             f"Values of Index must all be same type.  "
   ...:             f"Types {d} and {ak.dtype(type(item))} do not match."
   ...:         )
   ...:     return d
   ...:

In [5]: lst = [1, 1.0, 'hi']

In [6]: _dtype_of_lst(lst)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)

AssertionError: Values of Index must all be same type.  Types int64 and float64 do not match.

In [7]: np.array(lst).dtype
Out[7]: dtype('<U32')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The only place this is used is to set the dtype of an index that has a list for values. Since a mixed type list can converted to a pdarray (aparently), I can't think of why this would be a problem?

In [7]: l = ["abc", 1, 1.3]

In [8]: np.array(l).dtype
Out[8]: dtype('<U32')

In [9]: ak.array(l)
Out[9]: array(['abc', '1', '1.3'])

arkouda/index.py Outdated
Comment on lines 357 to 359
Return a sorted copy of the index.

Return a sorted copy of the index, and optionally return the indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we say Return a sorted copy of the index twice. Idk if that's on purpose or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I combined these to simplify. Thanks for pointing this out.

arkouda/index.py Outdated
Comment on lines 429 to 436
is_nan = np_isnan(self.values)
nan_vals = [i for (i, b) in zip(perm, is_nan) if b]
not_nan_vals = [i for (i, b) in zip(perm, is_nan) if not b]

if na_position == "last":
perm = [*not_nan_vals, *nan_vals] # type: ignore
elif na_position == "first":
perm = [*nan_vals, *not_nan_vals] # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't similar logic for the pdarray case work here? I know these values are relatively small since it's a list so perf doesn't really matter but doing a second list comprehension just to get the negation of the first seems kinda unnecessary. Plus this would have the added benefit of not having to think to hard to understand this case if you already understand the pdarray case

is_nan = np_isnan(self.values)
perm = np.array(perm)

if na_position == "last":
    perm = np.concatenate([perm[~is_nan], perm[is_nan]])
elif na_position == "first":
    perm = np.concatenate([perm[is_nan], perm[~is_nan]])

perm = perm.tolist()

you could also use a ternary for the middle bit you're feeling extra since you've already ensured that na_position is either "first" or "last"

also super not necessary, but I think a neat little trick is to use the python builtin sorted here since we can use the key to sort on is_nan where False / 0 comes before True/1, so non-nan's come first. we'd want to reverse if na_position == "first"

perm = [p for _, p in sorted(zip(np.isnan(self.values), perm), key=lambda pair: pair[0], reverse=(na_position == "first"))]

This could also be broken out to be more than one line for readability if you prefer. But once again this is all nitpicking

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is a lot but the result is actually important lol

While sanity checking these are all equivalent, I actually discovered that we need to set is_nan = np_isnan(self.values)[perm] in order to get the correct behavior. I think this also applies to the pdarray case

In [2]: vals = [4, np.nan, 1, np.nan, 0, np.nan, 3, np.nan, 2]

In [3]: perm = np.argsort(vals).tolist()

In [4]: np_vals = np.array(vals)

In [5]: def argsort_methods(is_nan):
   ...:     # Original method
   ...:     nan_vals = [i for (i, b) in zip(perm, is_nan) if b]
   ...:     not_nan_vals = [i for (i, b) in zip(perm, is_nan) if not b]
   ...:     orig_last = [*not_nan_vals, *nan_vals]
   ...:     orig_first = [*nan_vals, *not_nan_vals]
   ...:     print("ORIG METHOD")
   ...:     print("first: ", orig_first)
   ...:     print("last: ", orig_last)
   ...:
   ...:     # similar to pdarray case
   ...:     np_perm = np.array(perm)
   ...:     sim_last = np.concatenate([np_perm[~is_nan], np_perm[is_nan]])
   ...:     sim_first = np.concatenate([np_perm[is_nan], np_perm[~is_nan]])
   ...:     print("SIMILAR METHOD")
   ...:     print("first: ", sim_first.tolist())
   ...:     print("last: ", sim_last.tolist())
   ...:
   ...:     # using sorted builtin
   ...:     sorted_first = [p for _, p in sorted(zip(is_nan, perm), key=lambda pair: pair[0], reverse=True)]
   ...:     sorted_last = [p for _, p in sorted(zip(is_nan, perm), key=lambda pair: pair[0])]
   ...:     print("SORTED METHOD")
   ...:     print("first: ", sorted_first)
   ...:     print("last: ", sorted_last)
   ...:
   ...:     return orig_first, orig_last
   ...:

In [6]: first, last = argsort_methods(np.isnan(vals))
ORIG METHOD
first:  [2, 6, 1, 5, 4, 8, 0, 3, 7]
last:  [4, 8, 0, 3, 7, 2, 6, 1, 5]
SIMILAR METHOD
first:  [2, 6, 1, 5, 4, 8, 0, 3, 7]
last:  [4, 8, 0, 3, 7, 2, 6, 1, 5]
SORTED METHOD
first:  [2, 6, 1, 5, 4, 8, 0, 3, 7]
last:  [4, 8, 0, 3, 7, 2, 6, 1, 5]

In [7]: np_vals[first]
Out[7]: array([ 1.,  3., nan, nan,  0.,  2.,  4., nan, nan])

In [8]: np_vals[last]
Out[8]: array([ 0.,  2.,  4., nan, nan,  1.,  3., nan, nan])

In [9]: first, last = argsort_methods(np.isnan(vals)[perm])
ORIG METHOD
first:  [1, 3, 5, 7, 4, 2, 8, 6, 0]
last:  [4, 2, 8, 6, 0, 1, 3, 5, 7]
SIMILAR METHOD
first:  [1, 3, 5, 7, 4, 2, 8, 6, 0]
last:  [4, 2, 8, 6, 0, 1, 3, 5, 7]
SORTED METHOD
first:  [1, 3, 5, 7, 4, 2, 8, 6, 0]
last:  [4, 2, 8, 6, 0, 1, 3, 5, 7]

In [10]: np_vals[first]
Out[10]: array([nan, nan, nan, nan,  0.,  1.,  2.,  3.,  4.])

In [11]: np_vals[last]
Out[11]: array([ 0.,  1.,  2.,  3.,  4., nan, nan, nan, nan])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the changes and added an extra unit test.

arkouda/index.py Outdated

from arkouda import isnan as ak_isnan

is_nan = ak_isnan(self.values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be ak_isnan(self.values)[perm]. The list case has more info on how I came about this but running the following code using this PR gives incorrect results:

>>> vals = ak.array([4, np.nan, 1, np.nan, 0, np.nan, 3, np.nan, 2])
>>> ak.Index(vals).sort_values()
Index(array([0.0 2.0 4.0 nan nan 1.0 3.0 nan nan]), dtype='float64')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the changes and added an extra unit test.

Comment on lines 711 to 713
# For non-numeric values, need the descending arange because reverse slicing
# is not supported
idx = argsort(self.values)[arange(self.values.size - 1, -1, -1)]
perm = argsort(self.values)[arange(self.values.size - 1, -1, -1)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, i'm not sure when / if this comment used to be true. But it's not the case anymore

>>> vals = ak.array(['c', 'a', 'b', 'c', 'c', 'b'])
>>> ak.argsort(vals)[::-1]
array([4 3 0 5 2 1])

>>> vals[ak.argsort(vals)[::-1]]
array(['c', 'c', 'c', 'b', 'b', 'a'])

using git blame it looks like it was present when series was originally ported over from user provided code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, OK, I will simplify this. Thanks for noticing and pointing this out.

else:
from arkouda import arange

return sorted_array[arange(sorted_array.size - 1, -1, -1)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I've commented on similar code elsewhere but want to make sure this spot doesn't fall through the cracks

else:
from arkouda import arange

return sorted_array[arange(sorted_array.size - 1, -1, -1)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@ajpotts ajpotts force-pushed the 3177-Index.sort_values branch from ffc5d4b to cad665d Compare June 26, 2024 21:18
@ajpotts ajpotts force-pushed the 3177-Index.sort_values branch from dc2b9be to 85d8caf Compare October 11, 2024 00:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Index.sort_values
3 participants