Create cat regressor #3353

Intron7 · 2024-11-11T14:06:20Z

Use numba to create the regressor for categorical regression

codecov · 2024-11-11T14:21:13Z

Codecov Report

Attention: Patch coverage is 53.84615% with 6 lines in your changes missing coverage. Please review.

Project coverage is 76.46%. Comparing base (6dd0a7a) to head (2421bd5).
Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
src/scanpy/preprocessing/_simple.py	53.84%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3353      +/-   ##
==========================================
- Coverage   76.58%   76.46%   -0.12%     
==========================================
  Files         111      111              
  Lines       12862    12874      +12     
==========================================
- Hits         9850     9844       -6     
- Misses       3012     3030      +18

Files with missing lines	Coverage Δ
src/scanpy/preprocessing/_simple.py	`88.46% <53.84%> (-1.46%)`	⬇️

... and 7 files with indirect coverage changes

---- 🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

tests/test_preprocessing.py

src/scanpy/preprocessing/_simple.py

ilan-gold · 2024-11-11T15:13:59Z

tests/test_preprocessing.py

+    np.testing.assert_array_almost_equal(adata.X, tester)
+
+
+def test_regressor_categorical():


I would

explain why this test exists (to test against a previous implementation? I am impartial whether it's necessary TBH since we are already testing for reproducibility, could see getting rid of this)

refactor the "Create org regressors" into a helper function like create_original

I can see your point here

Do you have an an opinion on the first point? Is this test necessary? If so, perhaps a comment then?

tests/test_preprocessing.py

ilan-gold

I think this is missing: #3353 (comment) and the first part of https://github.com/scverse/scanpy/pull/3353/files#r1836830351

tests/test_preprocessing.py

src/scanpy/preprocessing/_simple.py

ilan-gold · 2024-11-12T13:32:38Z

src/scanpy/preprocessing/_simple.py

@@ -722,13 +737,13 @@ def regress_out(
                "we regress on the mean for each category."
            )
        logg.debug("... regressing on per-gene means within categories")
-        regressors = np.zeros(X.shape, dtype="float32")
+        # Create numpy array's from categorical variable
+        cats = np.int64(len(adata.obs[keys[0]].cat.categories))


Also comment why np.int64

because it has be done because of weird typing from pandas. So this ensures that it works within the kernel

so len doesn’t return a Python int? That’s a pandas bug.

Co-authored-by: Ilan Gold <[email protected]>

tests/test_preprocessing.py

src/scanpy/preprocessing/_simple.py

ilan-gold · 2024-11-12T15:53:37Z

tests/test_preprocessing.py

+    np.testing.assert_array_almost_equal(adata.X, tester)
+
+
+def test_regressor_categorical():


Do you have an an opinion on the first point? Is this test necessary? If so, perhaps a comment then?

src/scanpy/preprocessing/_simple.py

flying-sheep · 2024-11-14T08:41:57Z

src/scanpy/preprocessing/_simple.py

+        number_categories = np.int64(len(adata.obs[keys[0]].cat.categories))
+        filters = adata.obs[keys[0]].cat.codes.to_numpy()
+        number_categories = number_categories.astype(filters.dtype)


Either this or add a comment (to the code) explaining why it needs to be the other way.
Also if I do this, the test still passes, so …

Suggested change

number_categories = np.int64(len(adata.obs[keys[0]].cat.categories))

filters = adata.obs[keys[0]].cat.codes.to_numpy()

number_categories = number_categories.astype(filters.dtype)

number_categories = len(adata.obs[keys[0]].cat.categories)

filters = adata.obs[keys[0]].cat.codes.to_numpy()

I added a comment. Other wise you have a dtype missmatch and crash of the kernel

Other wise you have a dtype missmatch and crash of the kernel

I would say that this is the important part for the comment!

100%!

refactor your code until the “what” is obvious.

if the “why” isn’t obvious from understanding the “what”, add the missing parts as a comment

I see that you’re

convert the cat codes into a numpy array

creating a numpy scalar with the same dtype as filters, holding the number of categories

So you don’t need to comment that you do any of that.

I asked because I’m confused why a Python integer is converted to a numpy scalar: Usually APIs accept either and do the converting themselves. So I’d like to see a comment removing that confusion by explaining why you convert to a numpy scalar. (a crash is a great reason)

but I also see that _create_regressor_categorical has number_categories: int and then does range(number_categories), so I’m still very confused why numba crashes unless the dtypes match.

I can’t reproduce the crash. leaving the thing as a Python int just works for me.

Also the way to do this in one step is

Suggested change

number_categories = np.int64(len(adata.obs[keys[0]].cat.categories))

filters = adata.obs[keys[0]].cat.codes.to_numpy()

number_categories = number_categories.astype(filters.dtype)

filters = adata.obs[keys[0]].cat.codes.to_numpy()

number_categories = filters.dtype.type(len(adata.obs[keys[0]].cat.categories))

tests/test_preprocessing.py

ilan-gold · 2024-11-21T10:58:01Z

src/scanpy/preprocessing/_simple.py

+        number_categories = np.int64(len(adata.obs[keys[0]].cat.categories))
+        filters = adata.obs[keys[0]].cat.codes.to_numpy()
+        number_categories = number_categories.astype(filters.dtype)


Other wise you have a dtype missmatch and crash of the kernel

I would say that this is the important part for the comment!

ilan-gold · 2024-11-21T10:58:13Z

src/scanpy/preprocessing/_simple.py

+def _create_regressor_categorical(
+    X: np.ndarray, number_categories: int, filters: np.ndarray
+) -> np.ndarray:
+    # create regressor matrix faster for categorical variables


What does this comment mean?

scverse-benchmark · 2024-11-21T11:39:21Z

Benchmark changes

Change	Before [`6dd0a7a`]	After [`2421bd5`]	Ratio	Benchmark (Parameter)
+	366M	405M	1.11	preprocessing_counts.peakmem_scrublet('pbmc68k_reduced', 'counts')
-	1.39±0.03ms	1.24±0.02ms	0.9	preprocessing_log.FastSuite.time_mean_var('pbmc3k', 'off-axis')
-	584M	480M	0.82	preprocessing_log.peakmem_pca('pbmc3k', 'off-axis')

Comparison: https://github.com/scverse/scanpy/compare/6dd0a7a72c7f8f57a082cca0f6a369dc47937b04..2421bd55496036151b73c46c5ec7ffa7e5ef71eb
Last changed: Thu, 21 Nov 2024 11:39:20 +0000

More details: https://github.com/scverse/scanpy/pull/3353/checks?check_run_id=33316268173

Intron7 added 3 commits November 11, 2024 14:35

add function and test

086f70d

add test

37244a9

add test for regressor

b4ecb0a

Intron7 added this to the 1.11.0 milestone Nov 11, 2024

Intron7 and others added 2 commits November 11, 2024 15:54

add release note

36858d9

Merge branch 'main' into create_cat_regressor

be1bccc

Intron7 requested review from flying-sheep and ilan-gold November 11, 2024 14:56

ilan-gold requested changes Nov 11, 2024

View reviewed changes

Intron7 added 2 commits November 11, 2024 16:25

update typing

a1a59ae

update test

7b41bc8

Intron7 requested a review from ilan-gold November 11, 2024 15:36

ilan-gold requested changes Nov 12, 2024

View reviewed changes

Intron7 added 2 commits November 12, 2024 13:45

update test

119a142

update dtype

d77fa9c

ilan-gold requested changes Nov 12, 2024

View reviewed changes

Intron7 and others added 4 commits November 12, 2024 14:44

rename cats

236e356

Update tests/test_preprocessing.py

bb9cde4

Co-authored-by: Ilan Gold <[email protected]>

Update tests/test_preprocessing.py

bbb5035

Co-authored-by: Ilan Gold <[email protected]>

Update tests/test_preprocessing.py

2a92193

Co-authored-by: Ilan Gold <[email protected]>

Intron7 requested a review from ilan-gold November 12, 2024 15:18

ilan-gold requested changes Nov 12, 2024

View reviewed changes

ilan-gold and others added 4 commits November 12, 2024 16:53

Update tests/test_preprocessing.py

c7b78c0

remove test

b001c0e

update kernel

c3ce03e

remove test

c50226a

Intron7 requested a review from ilan-gold November 13, 2024 10:55

flying-sheep requested changes Nov 14, 2024

View reviewed changes

make test together

c6665f4

Intron7 added 2 commits November 21, 2024 11:43

cleanup

858e247

add comment

2421bd5

Intron7 requested a review from flying-sheep November 21, 2024 10:45

ilan-gold reviewed Nov 21, 2024

View reviewed changes

ilan-gold added the benchmark label Nov 21, 2024

flying-sheep removed their request for review November 21, 2024 11:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create cat regressor #3353

Create cat regressor #3353

Intron7 commented Nov 11, 2024

codecov bot commented Nov 11, 2024 •

edited

Loading

ilan-gold Nov 11, 2024

Intron7 Nov 11, 2024

ilan-gold Nov 12, 2024

ilan-gold left a comment

ilan-gold Nov 12, 2024

ilan-gold Nov 12, 2024

Intron7 Nov 12, 2024

flying-sheep Nov 21, 2024

ilan-gold Nov 12, 2024

flying-sheep Nov 14, 2024 •

edited

Loading

Intron7 Nov 21, 2024

ilan-gold Nov 21, 2024

flying-sheep Nov 21, 2024 •

edited

Loading

flying-sheep Nov 21, 2024

ilan-gold Nov 21, 2024

ilan-gold Nov 21, 2024

scverse-benchmark bot commented Nov 21, 2024

		np.testing.assert_array_almost_equal(adata.X, tester)


		def test_regressor_categorical():

Create cat regressor #3353

Are you sure you want to change the base?

Create cat regressor #3353

Conversation

Intron7 commented Nov 11, 2024

codecov bot commented Nov 11, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilan-gold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flying-sheep Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flying-sheep Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scverse-benchmark bot commented Nov 21, 2024

Benchmark changes

codecov bot commented Nov 11, 2024 •

edited

Loading

flying-sheep Nov 14, 2024 •

edited

Loading

flying-sheep Nov 21, 2024 •

edited

Loading