Closes #3177 Index.sort_values #3235

ajpotts · 2024-05-23T20:32:47Z

Closes #3177 Index.sort_values

stress-tess

I haven't looked at the tests terribly closely yet. I left a few comments for you to check out but the most important is I stumbled upon a bug due to using ak_isnan(self.values) instead of ak_isnan(self.values)[perm]

>>> vals = ak.array([4, np.nan, 1, np.nan, 0, np.nan, 3, np.nan, 2])
>>> ak.Index(vals).sort_values()
Index(array([0.0 2.0 4.0 nan nan 1.0 3.0 nan nan]), dtype='float64')

stress-tess · 2024-06-24T21:51:11Z

arkouda/categorical.py

+        if ascending is True:
+            return sorted_array
+        else:
+            return sorted_array[arange(sorted_array.size - 1, -1, -1)]


I'm trying to think how sorted_array[arange(sorted_array.size - 1, -1, -1)] differs from something like sorted_array[::-1]. I can't think of any benefit to creating the array and the slice avoids having to allocate the memory for the indexing array

I would normally use the slice here, but even that seems not ideal since pdarray indexing is kinda aimed for the general random access into the array case. But in the reverse an array case there's so much structure, you'd think we could do better. One thing I found when I searched our codebase was flip added by @jeremiah-corrado for array api stuff. The implementation is for arbitrary dimensions and the details are a bit hard for me to parse through, so it's not immediately if this would have better performance but it's something that def worth looking into since reversing an array pops up a lot

python code:

arkouda/arkouda/array_api/manipulation_functions.py

Lines 122 to 132 in 74b5889

def flip(x: Array, /, *, axis: Optional[Union[int, Tuple[int, ...]]] = None) -> Array:

"""

Reverse an array's values along a particular axis or axes.

Parameters

----------

x : Array

The array to flip

axis : int or Tuple[int, ...], optional

The axis or axes along which to flip the array. If None, flip the array along all axes.

"""

chpl code:

arkouda/src/ManipulationMsg.chpl

Line 373 in 74b5889

proc flipMsg(cmd: string, msgArgs: borrowed MessageArgs, st: borrowed SymTab, param nd: int): MsgTuple throws {

FWIW, that implementation of flip is not very efficient because it has to generalize to any number of dimensions. I'd expect it would give you about the same performance as sorted_array[::-1].

A good strategy might be to use flipMsg for now, where I'll probably add a more efficient 1D path at some point in the future. And then we won't have to do anything for sort_values to take advantage of those performance improvements when that happens. (This will require uncommenting this line to always build ManipulationMsg — which is a switch we were going to have to flip at some point to support other numpy-like functionality in the main arkouda api).

makes sense! I agree,I think enabling ManipulationMsg in the server modules config file and using flip here is the way to go

stress-tess · 2024-06-24T22:02:33Z

arkouda/index.py

        if isinstance(lst, list):
-            d = dtype(type(lst[0]))
-            for item in lst:
-                assert dtype(type(item)) == d, (
-                    f"Values of Index must all be same type.  "
-                    f"Types {d} and {dtype(type(item))} do not match."
-                )
-            return d
+            return np.array(lst).dtype


Just pointing out that this used to error on a list of mixed types. I think this a great simplification as long as that's the behavior we're going for

In [4]: def _dtype_of_lst(lst): ...: d = ak.dtype(type(lst[0])) ...: for item in lst: ...: assert ak.dtype(type(item)) == d, ( ...: f"Values of Index must all be same type. " ...: f"Types {d} and {ak.dtype(type(item))} do not match." ...: ) ...: return d ...: In [5]: lst = [1, 1.0, 'hi'] In [6]: _dtype_of_lst(lst) --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) AssertionError: Values of Index must all be same type. Types int64 and float64 do not match. In [7]: np.array(lst).dtype Out[7]: dtype('<U32')

Good point. The only place this is used is to set the dtype of an index that has a list for values. Since a mixed type list can converted to a pdarray (aparently), I can't think of why this would be a problem?

In [7]: l = ["abc", 1, 1.3] In [8]: np.array(l).dtype Out[8]: dtype('<U32') In [9]: ak.array(l) Out[9]: array(['abc', '1', '1.3'])

stress-tess · 2024-06-24T22:09:01Z

arkouda/index.py

+        Return a sorted copy of the index.
+
+        Return a sorted copy of the index, and optionally return the indices


we say Return a sorted copy of the index twice. Idk if that's on purpose or not

I combined these to simplify. Thanks for pointing this out.

stress-tess · 2024-06-24T22:38:16Z

arkouda/index.py

+            is_nan = np_isnan(self.values)
+            nan_vals = [i for (i, b) in zip(perm, is_nan) if b]
+            not_nan_vals = [i for (i, b) in zip(perm, is_nan) if not b]
+
+            if na_position == "last":
+                perm = [*not_nan_vals, *nan_vals]  # type: ignore
+            elif na_position == "first":
+                perm = [*nan_vals, *not_nan_vals]  # type: ignore


wouldn't similar logic for the pdarray case work here? I know these values are relatively small since it's a list so perf doesn't really matter but doing a second list comprehension just to get the negation of the first seems kinda unnecessary. Plus this would have the added benefit of not having to think to hard to understand this case if you already understand the pdarray case

is_nan = np_isnan(self.values) perm = np.array(perm) if na_position == "last": perm = np.concatenate([perm[~is_nan], perm[is_nan]]) elif na_position == "first": perm = np.concatenate([perm[is_nan], perm[~is_nan]]) perm = perm.tolist()

you could also use a ternary for the middle bit you're feeling extra since you've already ensured that na_position is either "first" or "last"

also super not necessary, but I think a neat little trick is to use the python builtin sorted here since we can use the key to sort on is_nan where False / 0 comes before True/1, so non-nan's come first. we'd want to reverse if na_position == "first"

perm = [p for _, p in sorted(zip(np.isnan(self.values), perm), key=lambda pair: pair[0], reverse=(na_position == "first"))]

This could also be broken out to be more than one line for readability if you prefer. But once again this is all nitpicking

Sorry this is a lot but the result is actually important lol

While sanity checking these are all equivalent, I actually discovered that we need to set is_nan = np_isnan(self.values)[perm] in order to get the correct behavior. I think this also applies to the pdarray case

In [2]: vals = [4, np.nan, 1, np.nan, 0, np.nan, 3, np.nan, 2] In [3]: perm = np.argsort(vals).tolist() In [4]: np_vals = np.array(vals) In [5]: def argsort_methods(is_nan): ...: # Original method ...: nan_vals = [i for (i, b) in zip(perm, is_nan) if b] ...: not_nan_vals = [i for (i, b) in zip(perm, is_nan) if not b] ...: orig_last = [*not_nan_vals, *nan_vals] ...: orig_first = [*nan_vals, *not_nan_vals] ...: print("ORIG METHOD") ...: print("first: ", orig_first) ...: print("last: ", orig_last) ...: ...: # similar to pdarray case ...: np_perm = np.array(perm) ...: sim_last = np.concatenate([np_perm[~is_nan], np_perm[is_nan]]) ...: sim_first = np.concatenate([np_perm[is_nan], np_perm[~is_nan]]) ...: print("SIMILAR METHOD") ...: print("first: ", sim_first.tolist()) ...: print("last: ", sim_last.tolist()) ...: ...: # using sorted builtin ...: sorted_first = [p for _, p in sorted(zip(is_nan, perm), key=lambda pair: pair[0], reverse=True)] ...: sorted_last = [p for _, p in sorted(zip(is_nan, perm), key=lambda pair: pair[0])] ...: print("SORTED METHOD") ...: print("first: ", sorted_first) ...: print("last: ", sorted_last) ...: ...: return orig_first, orig_last ...: In [6]: first, last = argsort_methods(np.isnan(vals)) ORIG METHOD first: [2, 6, 1, 5, 4, 8, 0, 3, 7] last: [4, 8, 0, 3, 7, 2, 6, 1, 5] SIMILAR METHOD first: [2, 6, 1, 5, 4, 8, 0, 3, 7] last: [4, 8, 0, 3, 7, 2, 6, 1, 5] SORTED METHOD first: [2, 6, 1, 5, 4, 8, 0, 3, 7] last: [4, 8, 0, 3, 7, 2, 6, 1, 5] In [7]: np_vals[first] Out[7]: array([ 1., 3., nan, nan, 0., 2., 4., nan, nan]) In [8]: np_vals[last] Out[8]: array([ 0., 2., 4., nan, nan, 1., 3., nan, nan]) In [9]: first, last = argsort_methods(np.isnan(vals)[perm]) ORIG METHOD first: [1, 3, 5, 7, 4, 2, 8, 6, 0] last: [4, 2, 8, 6, 0, 1, 3, 5, 7] SIMILAR METHOD first: [1, 3, 5, 7, 4, 2, 8, 6, 0] last: [4, 2, 8, 6, 0, 1, 3, 5, 7] SORTED METHOD first: [1, 3, 5, 7, 4, 2, 8, 6, 0] last: [4, 2, 8, 6, 0, 1, 3, 5, 7] In [10]: np_vals[first] Out[10]: array([nan, nan, nan, nan, 0., 1., 2., 3., 4.]) In [11]: np_vals[last] Out[11]: array([ 0., 1., 2., 3., 4., nan, nan, nan, nan])

I made the changes and added an extra unit test.

stress-tess · 2024-06-24T23:43:13Z

arkouda/index.py

+
+            from arkouda import isnan as ak_isnan
+
+            is_nan = ak_isnan(self.values)


I think this needs to be ak_isnan(self.values)[perm]. The list case has more info on how I came about this but running the following code using this PR gives incorrect results:

>>> vals = ak.array([4, np.nan, 1, np.nan, 0, np.nan, 3, np.nan, 2]) >>> ak.Index(vals).sort_values() Index(array([0.0 2.0 4.0 nan nan 1.0 3.0 nan nan]), dtype='float64')

I made the changes and added an extra unit test.

stress-tess · 2024-06-25T00:02:55Z

arkouda/series.py

                # For non-numeric values, need the descending arange because reverse slicing
                # is not supported
-                idx = argsort(self.values)[arange(self.values.size - 1, -1, -1)]
+                perm = argsort(self.values)[arange(self.values.size - 1, -1, -1)]


huh, i'm not sure when / if this comment used to be true. But it's not the case anymore

>>> vals = ak.array(['c', 'a', 'b', 'c', 'c', 'b']) >>> ak.argsort(vals)[::-1] array([4 3 0 5 2 1]) >>> vals[ak.argsort(vals)[::-1]] array(['c', 'c', 'c', 'b', 'b', 'a'])

using git blame it looks like it was present when series was originally ported over from user provided code

Hmm, OK, I will simplify this. Thanks for noticing and pointing this out.

stress-tess · 2024-06-25T00:07:41Z

arkouda/sorting.py

+    else:
+        from arkouda import arange
+
+        return sorted_array[arange(sorted_array.size - 1, -1, -1)]


I know I've commented on similar code elsewhere but want to make sure this spot doesn't fall through the cracks

stress-tess · 2024-06-25T00:08:15Z

arkouda/sorting.py

+    else:
+        from arkouda import arange
+
+        return sorted_array[arange(sorted_array.size - 1, -1, -1)]


same as above

ajpotts force-pushed the 3177-Index.sort_values branch 2 times, most recently from 21af366 to 6f8dab4 Compare May 23, 2024 20:35

ajpotts changed the title ~~3177 index.sort values~~ Closes #3177 index.sort values May 23, 2024

ajpotts changed the title ~~Closes #3177 index.sort values~~ Closes #3177 Index.sort_values May 23, 2024

ajpotts force-pushed the 3177-Index.sort_values branch 7 times, most recently from d09dc21 to 02f814c Compare May 30, 2024 15:48

ajpotts marked this pull request as ready for review May 30, 2024 15:48

ajpotts requested review from stress-tess, jaketrookman and drculhane May 30, 2024 15:48

ajpotts force-pushed the 3177-Index.sort_values branch 4 times, most recently from 9638b14 to ad6be3a Compare May 31, 2024 13:37

ajpotts force-pushed the 3177-Index.sort_values branch from 1004ddf to ffc5d4b Compare June 10, 2024 20:22

stress-tess requested changes Jun 25, 2024

View reviewed changes

ajpotts force-pushed the 3177-Index.sort_values branch from ffc5d4b to cad665d Compare June 26, 2024 21:18

ajpotts added 3 commits October 10, 2024 20:11

Closes Bears-R-Us#3177 Index.sort_values

f1c7a56

fix in response to comment about reverse indexing

84cdc09

Fix bug found in code review.

85d8caf

ajpotts force-pushed the 3177-Index.sort_values branch from dc2b9be to 85d8caf Compare October 11, 2024 00:12

fix flake8 and mypy

01c174d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #3177 Index.sort_values #3235

Closes #3177 Index.sort_values #3235

ajpotts commented May 23, 2024 •

edited

Loading

stress-tess left a comment

stress-tess Jun 24, 2024

jeremiah-corrado Jun 25, 2024

stress-tess Jun 26, 2024

stress-tess Jun 24, 2024

ajpotts Jun 28, 2024

stress-tess Jun 24, 2024

ajpotts Jun 26, 2024

stress-tess Jun 24, 2024

stress-tess Jun 24, 2024

ajpotts Jun 28, 2024

stress-tess Jun 24, 2024

ajpotts Jun 28, 2024

stress-tess Jun 25, 2024

ajpotts Jun 26, 2024

stress-tess Jun 25, 2024

stress-tess Jun 25, 2024

	def flip(x: Array, /, *, axis: Optional[Union[int, Tuple[int, ...]]] = None) -> Array:
	"""
	Reverse an array's values along a particular axis or axes.

	Parameters
	----------
	x : Array
	The array to flip
	axis : int or Tuple[int, ...], optional
	The axis or axes along which to flip the array. If None, flip the array along all axes.
	"""

		Return a sorted copy of the index.

		Return a sorted copy of the index, and optionally return the indices


		from arkouda import isnan as ak_isnan

		is_nan = ak_isnan(self.values)

Closes #3177 Index.sort_values #3235

Are you sure you want to change the base?

Closes #3177 Index.sort_values #3235

Conversation

ajpotts commented May 23, 2024 • edited Loading

stress-tess left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajpotts commented May 23, 2024 •

edited

Loading