Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add reduce kernels #3136

Merged
merged 32 commits into from
Jun 25, 2024
Merged

feat: add reduce kernels #3136

merged 32 commits into from
Jun 25, 2024

Conversation

ManasviGoyal
Copy link
Collaborator

@ManasviGoyal ManasviGoyal commented May 30, 2024

Kernels tested for different block sizes

  • awkward_reduce_argmax
  • awkward_reduce_argmin
  • awkward_reduce_count_64
  • awkward_reduce_countnonzero
  • awkward_reduce_max
  • awkward_reduce_min
  • awkward_reduce_prod_bool
  • awkward_reduce_sum
  • awkward_reduce_sum_bool
  • awkward_reduce_sum_int32_bool_64
  • awkward_reduce_sum_int64_bool_64
  • awkward_reduce_prod
  • awkward_ListOffsetArray_reduce_local_outoffsets_64

@lgray
Copy link
Contributor

lgray commented Jun 6, 2024

@ManasviGoyal in trying to implement query 4 from the analysis benchmarks I found some nasty memory scaling:

If you scroll to the bottom of the trace below you'll see that when attempting to execute:
image

This is processing ~53M rows all at once in the input file, the data fit with no problem in the GPU itself. Same for the histogram that is being filled.

For this test I have merged #3123, #3142, and this PR on top of awkward main. This PR was merged in last.

However, for the ak.sum, where the calculation fails, it is attempting to allocation 71 terabytes of ram on the device. This seems excessive and is indicative of some poor memory scaling properties in the implementation. You'll see that this fails in the ak.sum step and nowhere else.

Here's the full stack trace:

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[3], line 16
     11 MET_pt = ak.to_backend(jetmet.MET_pt, "cuda")
     12 q4_hist = hist.Hist(
     13     "Counts",
     14     hist.Bin("met", "$E_{T}^{miss}$ [GeV]", 100, 0, 200),
     15 )
---> 16 has2jets = ak.sum(Jet_pt > 40, axis=1) >= 2
     17 q4_hist.fill(met=MET_pt[has2jets])
     19 q4_hist.to_hist().plot1d(flow="none");

File [~/coffea-gpu/awkward/src/awkward/_dispatch.py:64](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_dispatch.py#line=63), in named_high_level_function.<locals>.dispatch(*args, **kwargs)
     62 # Failed to find a custom overload, so resume the original function
     63 try:
---> 64     next(gen_or_result)
     65 except StopIteration as err:
     66     return err.value

File [~/coffea-gpu/awkward/src/awkward/operations/ak_sum.py:210](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/operations/ak_sum.py#line=209), in sum(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
    207 yield (array,)
    209 # Implementation
--> 210 return _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)

File [~/coffea-gpu/awkward/src/awkward/operations/ak_sum.py:277](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/operations/ak_sum.py#line=276), in _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
    274     layout = ctx.unwrap(array, allow_record=False, primitive_policy="error")
    275 reducer = ak._reducers.Sum()
--> 277 out = ak._do.reduce(
    278     layout,
    279     reducer,
    280     axis=axis,
    281     mask=mask_identity,
    282     keepdims=keepdims,
    283     behavior=ctx.behavior,
    284 )
    285 return ctx.wrap(out, highlevel=highlevel, allow_other=True)

File [~/coffea-gpu/awkward/src/awkward/_do.py:333](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_do.py#line=332), in reduce(layout, reducer, axis, mask, keepdims, behavior)
    331 parents = ak.index.Index64.zeros(layout.length, layout.backend.index_nplike)
    332 shifts = None
--> 333 next = layout._reduce_next(
    334     reducer,
    335     negaxis,
    336     starts,
    337     shifts,
    338     parents,
    339     1,
    340     mask,
    341     keepdims,
    342     behavior,
    343 )
    345 return next[0]

File [~/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py:1612](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py#line=1611), in ListOffsetArray._reduce_next(self, reducer, negaxis, starts, shifts, parents, outlength, mask, keepdims, behavior)
   1609 trimmed = self._content[self.offsets[0] : self.offsets[-1]]
   1610 nextstarts = self.offsets[:-1]
-> 1612 outcontent = trimmed._reduce_next(
   1613     reducer,
   1614     negaxis,
   1615     nextstarts,
   1616     shifts,
   1617     nextparents,
   1618     globalstarts_length,
   1619     mask,
   1620     keepdims,
   1621     behavior,
   1622 )
   1624 outoffsets = Index64.empty(outlength + 1, index_nplike)
   1625 assert outoffsets.nplike is index_nplike and parents.nplike is index_nplike

File [~/coffea-gpu/awkward/src/awkward/contents/numpyarray.py:1122](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/numpyarray.py#line=1121), in NumpyArray._reduce_next(self, reducer, negaxis, starts, shifts, parents, outlength, mask, keepdims, behavior)
   1119 assert self.is_contiguous
   1120 assert self._data.ndim == 1
-> 1122 out = reducer.apply(self, parents, starts, shifts, outlength)
   1124 if mask:
   1125     outmask = ak.index.Index8.empty(outlength, self._backend.index_nplike)

File [~/coffea-gpu/awkward/src/awkward/_reducers.py:358](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_reducers.py#line=357), in Sum.apply(self, array, parents, starts, shifts, outlength)
    355 if result.dtype in (np.int64, np.uint64):
    356     assert parents.nplike is array.backend.index_nplike
    357     array.backend.maybe_kernel_error(
--> 358         array.backend[
    359             "awkward_reduce_sum_int64_bool_64",
    360             np.int64,
    361             array.dtype.type,
    362             parents.dtype.type,
    363         ](
    364             result,
    365             array.data,
    366             parents.data,
    367             parents.length,
    368             outlength,
    369         )
    370     )
    371 elif result.dtype in (np.int32, np.uint32):
    372     assert parents.nplike is array.backend.index_nplike

File [~/coffea-gpu/awkward/src/awkward/_kernels.py:169](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_kernels.py#line=168), in CupyKernel.__call__(self, *args)
    157 args = (
    158     *args,
    159     len(ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][1]),
    160     ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][0],
    161 )
    162 ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][1].append(
    163     ak_cuda.Invocation(
    164         name=self.key[0],
    165         error_context=ak._errors.ErrorContext.primary(),
    166     )
    167 )
--> 169 self._impl(grid, blocks, args)

File [~/coffea-gpu/awkward/src/awkward/_connect/cuda/_kernel_signatures.py:4337](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_connect/cuda/_kernel_signatures.py#line=4336), in by_signature.<locals>.f(grid, block, args)
   4335     segment = 0
   4336     grid_size = 1
-> 4337 partial = cupy.zeros(outlength * grid_size, dtype=toptr.dtype)
   4338 temp = cupy.zeros(lenparents, dtype=toptr.dtype)
   4339 cuda_kernel_templates.get_function(fetch_specialization(["awkward_reduce_sum_int64_bool_64_a", int64, bool_, parents.dtype]))((grid_size,), block, (toptr, fromptr, parents, lenparents, outlength, partial, temp, invocation_index, err_code))

File ~/.conda/envs/coffea-gpu/lib/python3.12/site-packages/cupy/_creation/basic.py:248, in zeros(shape, dtype, order)
    229 def zeros(
    230         shape: _ShapeLike,
    231         dtype: DTypeLike = float,
    232         order: _OrderCF = 'C',
    233 ) -> NDArray[Any]:
    234     """Returns a new array of given shape and dtype, filled with zeros.
    235 
    236     Args:
   (...)
    246 
    247     """
--> 248     a = cupy.ndarray(shape, dtype, order=order)
    249     a.data.memset_async(0, a.nbytes)
    250     return a

File cupy[/_core/core.pyx:132](https://analytics-hub.fnal.gov/_core/core.pyx#line=131), in cupy._core.core.ndarray.__new__()

File cupy[/_core/core.pyx:220](https://analytics-hub.fnal.gov/_core/core.pyx#line=219), in cupy._core.core._ndarray_base._init()

File cupy[/cuda/memory.pyx:738](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=737), in cupy.cuda.memory.alloc()

File cupy[/cuda/memory.pyx:1424](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1423), in cupy.cuda.memory.MemoryPool.malloc()

File cupy[/cuda/memory.pyx:1445](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1444), in cupy.cuda.memory.MemoryPool.malloc()

File cupy[/cuda/memory.pyx:1116](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1115), in cupy.cuda.memory.SingleDeviceMemoryPool.malloc()

File cupy[/cuda/memory.pyx:1137](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1136), in cupy.cuda.memory.SingleDeviceMemoryPool._malloc()

File cupy[/cuda/memory.pyx:1382](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1381), in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc()

File cupy[/cuda/memory.pyx:1385](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1384), in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc()

OutOfMemoryError: Out of memory allocating 71,381,459,340,288 bytes (allocated so far: 3,718,910,464 bytes).

This error occurred while calling

    ak.sum(
        <Array [[True, False], [...], ..., [False]] type='53446198 * var * ...'>
        axis = 1
    )

@ManasviGoyal
Copy link
Collaborator Author

ManasviGoyal commented Jun 6, 2024

@ManasviGoyal in trying to implement query 4 from the analysis benchmarks I found some nasty memory scaling:

If you scroll to the bottom of the trace below you'll see that when attempting to execute: image

This is processing ~53M rows all at once in the input file, the data fit with no problem in the GPU itself. Same for the histogram that is being filled.

For this test I have merged #3123, #3142, and this PR on top of awkward main. This PR was merged in last.

However, for the ak.sum, where the calculation fails, it is attempting to allocation 71 terabytes of ram on the device. This seems excessive and is indicative of some poor memory scaling properties in the implementation. You'll see that this fails in the ak.sum step and nowhere else.

Here's the full stack trace:

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[3], line 16
     11 MET_pt = ak.to_backend(jetmet.MET_pt, "cuda")
     12 q4_hist = hist.Hist(
     13     "Counts",
     14     hist.Bin("met", "$E_{T}^{miss}$ [GeV]", 100, 0, 200),
     15 )
---> 16 has2jets = ak.sum(Jet_pt > 40, axis=1) >= 2
     17 q4_hist.fill(met=MET_pt[has2jets])
     19 q4_hist.to_hist().plot1d(flow="none");

File [~/coffea-gpu/awkward/src/awkward/_dispatch.py:64](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_dispatch.py#line=63), in named_high_level_function.<locals>.dispatch(*args, **kwargs)
     62 # Failed to find a custom overload, so resume the original function
     63 try:
---> 64     next(gen_or_result)
     65 except StopIteration as err:
     66     return err.value

File [~/coffea-gpu/awkward/src/awkward/operations/ak_sum.py:210](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/operations/ak_sum.py#line=209), in sum(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
    207 yield (array,)
    209 # Implementation
--> 210 return _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)

File [~/coffea-gpu/awkward/src/awkward/operations/ak_sum.py:277](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/operations/ak_sum.py#line=276), in _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
    274     layout = ctx.unwrap(array, allow_record=False, primitive_policy="error")
    275 reducer = ak._reducers.Sum()
--> 277 out = ak._do.reduce(
    278     layout,
    279     reducer,
    280     axis=axis,
    281     mask=mask_identity,
    282     keepdims=keepdims,
    283     behavior=ctx.behavior,
    284 )
    285 return ctx.wrap(out, highlevel=highlevel, allow_other=True)

File [~/coffea-gpu/awkward/src/awkward/_do.py:333](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_do.py#line=332), in reduce(layout, reducer, axis, mask, keepdims, behavior)
    331 parents = ak.index.Index64.zeros(layout.length, layout.backend.index_nplike)
    332 shifts = None
--> 333 next = layout._reduce_next(
    334     reducer,
    335     negaxis,
    336     starts,
    337     shifts,
    338     parents,
    339     1,
    340     mask,
    341     keepdims,
    342     behavior,
    343 )
    345 return next[0]

File [~/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py:1612](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py#line=1611), in ListOffsetArray._reduce_next(self, reducer, negaxis, starts, shifts, parents, outlength, mask, keepdims, behavior)
   1609 trimmed = self._content[self.offsets[0] : self.offsets[-1]]
   1610 nextstarts = self.offsets[:-1]
-> 1612 outcontent = trimmed._reduce_next(
   1613     reducer,
   1614     negaxis,
   1615     nextstarts,
   1616     shifts,
   1617     nextparents,
   1618     globalstarts_length,
   1619     mask,
   1620     keepdims,
   1621     behavior,
   1622 )
   1624 outoffsets = Index64.empty(outlength + 1, index_nplike)
   1625 assert outoffsets.nplike is index_nplike and parents.nplike is index_nplike

File [~/coffea-gpu/awkward/src/awkward/contents/numpyarray.py:1122](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/numpyarray.py#line=1121), in NumpyArray._reduce_next(self, reducer, negaxis, starts, shifts, parents, outlength, mask, keepdims, behavior)
   1119 assert self.is_contiguous
   1120 assert self._data.ndim == 1
-> 1122 out = reducer.apply(self, parents, starts, shifts, outlength)
   1124 if mask:
   1125     outmask = ak.index.Index8.empty(outlength, self._backend.index_nplike)

File [~/coffea-gpu/awkward/src/awkward/_reducers.py:358](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_reducers.py#line=357), in Sum.apply(self, array, parents, starts, shifts, outlength)
    355 if result.dtype in (np.int64, np.uint64):
    356     assert parents.nplike is array.backend.index_nplike
    357     array.backend.maybe_kernel_error(
--> 358         array.backend[
    359             "awkward_reduce_sum_int64_bool_64",
    360             np.int64,
    361             array.dtype.type,
    362             parents.dtype.type,
    363         ](
    364             result,
    365             array.data,
    366             parents.data,
    367             parents.length,
    368             outlength,
    369         )
    370     )
    371 elif result.dtype in (np.int32, np.uint32):
    372     assert parents.nplike is array.backend.index_nplike

File [~/coffea-gpu/awkward/src/awkward/_kernels.py:169](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_kernels.py#line=168), in CupyKernel.__call__(self, *args)
    157 args = (
    158     *args,
    159     len(ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][1]),
    160     ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][0],
    161 )
    162 ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][1].append(
    163     ak_cuda.Invocation(
    164         name=self.key[0],
    165         error_context=ak._errors.ErrorContext.primary(),
    166     )
    167 )
--> 169 self._impl(grid, blocks, args)

File [~/coffea-gpu/awkward/src/awkward/_connect/cuda/_kernel_signatures.py:4337](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_connect/cuda/_kernel_signatures.py#line=4336), in by_signature.<locals>.f(grid, block, args)
   4335     segment = 0
   4336     grid_size = 1
-> 4337 partial = cupy.zeros(outlength * grid_size, dtype=toptr.dtype)
   4338 temp = cupy.zeros(lenparents, dtype=toptr.dtype)
   4339 cuda_kernel_templates.get_function(fetch_specialization(["awkward_reduce_sum_int64_bool_64_a", int64, bool_, parents.dtype]))((grid_size,), block, (toptr, fromptr, parents, lenparents, outlength, partial, temp, invocation_index, err_code))

File ~/.conda/envs/coffea-gpu/lib/python3.12/site-packages/cupy/_creation/basic.py:248, in zeros(shape, dtype, order)
    229 def zeros(
    230         shape: _ShapeLike,
    231         dtype: DTypeLike = float,
    232         order: _OrderCF = 'C',
    233 ) -> NDArray[Any]:
    234     """Returns a new array of given shape and dtype, filled with zeros.
    235 
    236     Args:
   (...)
    246 
    247     """
--> 248     a = cupy.ndarray(shape, dtype, order=order)
    249     a.data.memset_async(0, a.nbytes)
    250     return a

File cupy[/_core/core.pyx:132](https://analytics-hub.fnal.gov/_core/core.pyx#line=131), in cupy._core.core.ndarray.__new__()

File cupy[/_core/core.pyx:220](https://analytics-hub.fnal.gov/_core/core.pyx#line=219), in cupy._core.core._ndarray_base._init()

File cupy[/cuda/memory.pyx:738](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=737), in cupy.cuda.memory.alloc()

File cupy[/cuda/memory.pyx:1424](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1423), in cupy.cuda.memory.MemoryPool.malloc()

File cupy[/cuda/memory.pyx:1445](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1444), in cupy.cuda.memory.MemoryPool.malloc()

File cupy[/cuda/memory.pyx:1116](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1115), in cupy.cuda.memory.SingleDeviceMemoryPool.malloc()

File cupy[/cuda/memory.pyx:1137](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1136), in cupy.cuda.memory.SingleDeviceMemoryPool._malloc()

File cupy[/cuda/memory.pyx:1382](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1381), in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc()

File cupy[/cuda/memory.pyx:1385](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1384), in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc()

OutOfMemoryError: Out of memory allocating 71,381,459,340,288 bytes (allocated so far: 3,718,910,464 bytes).

This error occurred while calling

    ak.sum(
        <Array [[True, False], [...], ..., [False]] type='53446198 * var * ...'>
        axis = 1
    )

Hi, I am still working on these kernels and need to fix a few things. I will update once I am done with this PR. The issue is most likely use to the use of partial. I plan to remove it. PR #3123 is just for experimenting in separate python scripts. That is not the implementation.

@lgray
Copy link
Contributor

lgray commented Jun 6, 2024

No worries - just reporting what I'm finding with things as they are. Thanks!

@ManasviGoyal
Copy link
Collaborator Author

No worries - just reporting what I'm finding with things as they are. Thanks!

Yes. It's very helpful since I can only test of a limited number of cases so knowing how it works for actual data helps in identifying the issues. Thanks! I will keep you updated.

@lgray
Copy link
Contributor

lgray commented Jun 7, 2024

The change to accumulator with atomics certainly fixed the memory issue, it's a bit slower than I expected for a sum though. ~250 MHz throughput for summing bools into int64.

As an optimization for sums on the last dimension, couldn't you write this without atomics or any race conditions by having each thread sum over the last dimension into an array of one less dimension? Or is the thread divergence too bad and atomics are still faster?

With the atomic implementation you're guaranteed to have access contention because each element is going to be hitting the same output position to make the sum. I don't have good intuition if that's going to be better or worse than thread divergence.

@jpivarski maybe?

@lgray
Copy link
Contributor

lgray commented Jun 7, 2024

In any case - with this latest change I've now got query 4 done on the ADL benchmarks. The rest seem to require combinations, so I'll wait for that!

@ManasviGoyal ManasviGoyal force-pushed the ManasviGoyal/add-reducer-kernels branch from a98503d to 38d314d Compare June 18, 2024 13:23
Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see any changes in this PR that would cause the test failures or causing numpy to use int32 for sum operations on boolean arrays (depending on the platform (32-bit vs 64-bit)).

We could explicitly cast the result of numpy sum to int64 to ensure the test doesn't fail:

numpy_sum = np.sum(array, axis=-1).astype(np.int64)

I'd say merge it - @jpivarski?

@ianna
Copy link
Collaborator

ianna commented Jun 19, 2024

I'm checking it with #3158

@ianna
Copy link
Collaborator

ianna commented Jun 19, 2024

I'm checking it with #3158
and #3159

I've opened an issue: #3160

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent!!! As I understand it, this enables all axis=-1 reducers, with tests for crossing block boundaries. As we talked about in our meeting, it could have more tests of block boundary crossing and integration tests (converted from tests to tests-cuda, particularly test_0115_generic_reducer_operation.py).

The one failing test is unrelated, and it's failing in main also, so it isn't a blocker to merging this PR. (We should never introduce failing tests into main, but it's already there. @ianna, if you need elevated permissions to bypass the red warning about merging with a failing test, I can give you those permissions.)

What is a blocker, however, is that these need to be tested on more than one GPU. @ianna will be able to test it on Sunday, and it can be merged if it passes on her GPU. I'll be able to test it on Tuesday, and I'll just test it in main (after merging).

Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ManasviGoyal - all tests pass except for the tests-cuda-kernels/test_cudaawkward_BitMaskedArray_to_ByteMaskedArray.py that relies on Numpy. I've added an import. Please, double check. Thanks!

dev/generate-tests.py Show resolved Hide resolved
Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ManasviGoyal - all tests pass on my local computer with the updated branch with the import of numpy! If it works fine on yours the PR is good to be merged. Thanks.

@ManasviGoyal
Copy link
Collaborator Author

@ManasviGoyal - all tests pass on my local computer with the updated branch with the import of numpy! If it works fine on yours the PR is good to be merged. Thanks.

@ianna All MacOS tests are cancelled in the CI due to which I am unable to merge.

@ianna
Copy link
Collaborator

ianna commented Jun 24, 2024

@jpivarski - I think, we need some other macOS node:

Run Tests (macos-11, 3.8, x64, full) This is a scheduled macOS-11 brownout. The macOS-11 environment is deprecated and will be removed on June 28th, 2024. --  

@ManasviGoyal
Copy link
Collaborator Author

This is excellent!!! As I understand it, this enables all axis=-1 reducers, with tests for crossing block boundaries. As we talked about in our meeting, it could have more tests of block boundary crossing and integration tests (converted from tests to tests-cuda, particularly test_0115_generic_reducer_operation.py).

@jpivarski #3162 adds all the axis=-1 tests in test_0115_generic_reducer_operation.py for cuda. I have also added tests to check for block boundary for array size = 3000 (primes for ak.prod) as we discussed in the last meeting. Thanks!

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ianna and I have both tested with GPUs and all is good, so I'll merge this now. (Before June 28, we avoid complications with MacOS > 11.)

@jpivarski jpivarski merged commit ba4890a into main Jun 25, 2024
39 checks passed
@jpivarski jpivarski deleted the ManasviGoyal/add-reducer-kernels branch June 25, 2024 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants