Reduce memory usage of as_categorical_column #14138

wence- · 2023-09-20T14:00:22Z

Description

The main culprit is in the way the codes returned from _label_encoding were being ordered. We were generating an int64 column for the order, gathering through the left gather map, and then argsorting, before using that ordering as a gather map for the codes.

We note that gather(y, with=argsort(x)) is equivalent to sort_by_key(y, with=x) so use that instead (avoiding an unnecessary gather). Furthermore we also note that gather([0..n), with=x) is just equivalent to x, so we can avoid a gather too.

This reduces the peak memory footprint of categorifying a random column of 500_000_000 int32 values where there are 100 unique values from 24.75 GiB to 11.67 GiB.

Test code

import cudf
import cupy as cp

K = 100
N = 500_000_000
rng = cp.random._generator.RandomState()
column = cudf.core.column.as_column(rng.choice(cp.arange(K, dtype="int32"), size=(N,), replace=True))
column = column.astype("category", ordered=False)

Before

After

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

The main culprit is in the way the codes returned from _label_encoding were being ordered. We were generating an int64 column for the order, gathering through the left gather map, and then argsorting, before using that ordering as a gather map for the codes. We note that gather(y, with=argsort(x)) is equivalent to sort_by_key(y, with=x) so use that instead (avoiding an unnecessary gather). Furthermore we also note that gather([0..n), with=x) is just equivalent to x, so we can avoid a gather too. This reduces the peak memory footprint of categorifying a random column of 500_000_000 int32 values where there are 100 unique values from 24.75 GiB to 11.67 GiB.

bdice

Great! Note that this is an example of the performance antipattern discussed in #13557.

wence- · 2023-09-20T20:18:27Z

/merge

harrism · 2023-09-20T21:26:26Z

Is performance affected?

wence- · 2023-09-21T09:27:12Z

Is performance affected?

Yes, but positively, I run:

import time
import cupy as cp
import cudf
import rmm

rmm.reinitialize(pool_allocator=True)

rng = cp.random._generator.RandomState(seed=108)
for K in [2**4, 2**10, 2**12, 2**14, 2**16]:
    for N in [1_000_000, 10_000_000, 100_000_000, 250_000_000]:
        col = cudf.core.column.as_column(rng.choice(cp.arange(K, dtype="uint32"), size=N, replace=True))
        start = time.time()
        for _ in range((reps := 1_000_000_000 // N)):
            y = col.astype("category", ordered=False)
            del y
        end = time.time()
        del col

Across column sizes and number of unique values, the new code is between 25 and 30% faster.

harrism · 2023-09-21T12:25:05Z

Excellent!

wence- requested a review from a team as a code owner September 20, 2023 14:00

wence- requested review from mroeschke and galipremsagar September 20, 2023 14:00

github-actions bot added the Python Affects Python cuDF API. label Sep 20, 2023

galipremsagar approved these changes Sep 20, 2023

View reviewed changes

wence- added Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 20, 2023

bdice approved these changes Sep 20, 2023

View reviewed changes

shwina approved these changes Sep 20, 2023

View reviewed changes

rapids-bot bot merged commit e87d2fc into rapidsai:branch-23.10 Sep 20, 2023
58 checks passed

wence- deleted the wence/fix/categorical-mem-usage branch September 20, 2023 20:18

wence- mentioned this pull request Sep 21, 2023

[FEA] Use libcudf Dictionary type for CategoricalColumn in Python #8573

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage of as_categorical_column #14138

Reduce memory usage of as_categorical_column #14138

wence- commented Sep 20, 2023

bdice left a comment

wence- commented Sep 20, 2023

harrism commented Sep 20, 2023

wence- commented Sep 21, 2023

harrism commented Sep 21, 2023

Reduce memory usage of as_categorical_column #14138

Reduce memory usage of as_categorical_column #14138

Conversation

wence- commented Sep 20, 2023

Description

Test code

Before

After

Checklist

bdice left a comment

Choose a reason for hiding this comment

wence- commented Sep 20, 2023

harrism commented Sep 20, 2023

wence- commented Sep 21, 2023

harrism commented Sep 21, 2023