-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create cat regressor #3353
base: main
Are you sure you want to change the base?
Create cat regressor #3353
Changes from 12 commits
086f70d
37244a9
b4ecb0a
36858d9
be1bccc
a1a59ae
7b41bc8
119a142
d77fa9c
236e356
bb9cde4
bbb5035
2a92193
c7b78c0
b001c0e
c3ce03e
c50226a
c6665f4
858e247
2421bd5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
* Speed up for a categorical regressor in {func}`~scanpy.pp.regress_out` {smaller}`S Dicks` |
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -628,6 +628,21 @@ def normalize_per_cell( | |||||||||||||||||||||
DT = TypeVar("DT") | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
@njit | ||||||||||||||||||||||
def _create_regressor_categorical( | ||||||||||||||||||||||
X: np.ndarray, number_categories: int, filters: np.ndarray | ||||||||||||||||||||||
) -> np.ndarray: | ||||||||||||||||||||||
# create regressor matrix faster for categorical variables | ||||||||||||||||||||||
regressors = np.zeros(X.shape, dtype=X.dtype) | ||||||||||||||||||||||
XT = X.T | ||||||||||||||||||||||
for category in range(number_categories): | ||||||||||||||||||||||
mask = category == filters | ||||||||||||||||||||||
for ix in numba.prange(XT.shape[0]): | ||||||||||||||||||||||
x = XT[ix] | ||||||||||||||||||||||
regressors[mask, ix] = x[mask].mean() | ||||||||||||||||||||||
flying-sheep marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||
return regressors | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
@njit | ||||||||||||||||||||||
def get_resid( | ||||||||||||||||||||||
data: np.ndarray, | ||||||||||||||||||||||
|
@@ -722,13 +737,13 @@ def regress_out( | |||||||||||||||||||||
"we regress on the mean for each category." | ||||||||||||||||||||||
) | ||||||||||||||||||||||
logg.debug("... regressing on per-gene means within categories") | ||||||||||||||||||||||
regressors = np.zeros(X.shape, dtype="float32") | ||||||||||||||||||||||
# Create numpy array's from categorical variable | ||||||||||||||||||||||
number_categories = np.int64(len(adata.obs[keys[0]].cat.categories)) | ||||||||||||||||||||||
filters = adata.obs[keys[0]].cat.codes.to_numpy() | ||||||||||||||||||||||
number_categories = number_categories.astype(filters.dtype) | ||||||||||||||||||||||
Comment on lines
+740
to
+742
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Either this or add a comment (to the code) explaining why it needs to be the other way.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added a comment. Other wise you have a dtype missmatch and crash of the kernel There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I would say that this is the important part for the comment! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 100%!
I see that you’re
So you don’t need to comment that you do any of that. I asked because I’m confused why a Python integer is converted to a numpy scalar: Usually APIs accept either and do the converting themselves. So I’d like to see a comment removing that confusion by explaining why you convert to a numpy scalar. (a crash is a great reason) but I also see that I can’t reproduce the crash. leaving the thing as a Python int just works for me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also the way to do this in one step is
Suggested change
|
||||||||||||||||||||||
|
||||||||||||||||||||||
X = _to_dense(X, order="F") if issparse(X) else X | ||||||||||||||||||||||
# TODO figure out if we should use a numba kernel for this | ||||||||||||||||||||||
for category in adata.obs[keys[0]].cat.categories: | ||||||||||||||||||||||
mask = (category == adata.obs[keys[0]]).values | ||||||||||||||||||||||
for ix, x in enumerate(X.T): | ||||||||||||||||||||||
regressors[mask, ix] = x[mask].mean() | ||||||||||||||||||||||
regressors = _create_regressor_categorical(X, number_categories, filters) | ||||||||||||||||||||||
variable_is_categorical = True | ||||||||||||||||||||||
# regress on one or several ordinal variables | ||||||||||||||||||||||
else: | ||||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,6 +13,8 @@ | |
from scipy.sparse import coo_matrix, csc_matrix, csr_matrix, issparse | ||
|
||
import scanpy as sc | ||
from scanpy.preprocessing._simple import _create_regressor_categorical | ||
from scanpy.preprocessing._utils import _to_dense | ||
from testing.scanpy._helpers import ( | ||
anndata_v0_8_constructor_compat, | ||
check_rep_mutation, | ||
|
@@ -327,14 +329,51 @@ def test_regress_out_constants(): | |
assert_equal(adata, adata_copy) | ||
|
||
|
||
def test_regress_out_reproducible(): | ||
adata = pbmc68k_reduced() | ||
@pytest.mark.parametrize( | ||
("keys", "expected_result_file_path"), | ||
[ | ||
(["n_counts", "percent_mito"], "regress_test_small.npy"), | ||
(["bulk_labels"], "regress_test_small_cat.npy"), | ||
], | ||
) | ||
def test_regress_out_reproducible(keys, tester_file): | ||
ilan-gold marked this conversation as resolved.
Show resolved
Hide resolved
|
||
adata = sc.datasets.pbmc68k_reduced() | ||
adata = adata.raw.to_adata()[:200, :200].copy() | ||
sc.pp.regress_out(adata, keys=["n_counts", "percent_mito"]) | ||
sc.pp.regress_out(adata, keys=keys) | ||
# This file was generated from the original implementation in version 1.10.3 | ||
# Now we compare new implementation with the old one | ||
tester = np.load(DATA_PATH / "regress_test_small.npy") | ||
np.testing.assert_allclose(adata.X, tester) | ||
# Now we compare the new implementation with the old one | ||
tester = np.load(DATA_PATH / expected_result_file_path) | ||
np.testing.assert_array_almost_equal(adata.X, tester) | ||
|
||
|
||
def _gen_org_regressors(adata, keys, X_org): | ||
# helper function to generate the original regressors | ||
regressors = np.zeros(X_org.shape, dtype=X_org.dtype) | ||
X = _to_dense(X_org, order="F") | ||
for category in adata.obs[keys[0]].cat.categories: | ||
mask = (category == adata.obs[keys[0]]).values | ||
for ix, x in enumerate(X.T): | ||
regressors[mask, ix] = x[mask].mean() | ||
return regressors | ||
|
||
|
||
def test_regressor_categorical(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can see your point here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you have an an opinion on the first point? Is this test necessary? If so, perhaps a comment then? |
||
adata = sc.datasets.pbmc68k_reduced() | ||
adata = adata.raw.to_adata()[:200, :200] | ||
X_org = adata.X.copy().astype(np.float64) | ||
keys = ["bulk_labels"] | ||
# Create org regressors | ||
regressors = _gen_org_regressors(adata, keys, X_org) | ||
|
||
# Create new regressors | ||
cats = np.int64(len(adata.obs[keys[0]].cat.categories)) | ||
filters = adata.obs[keys[0]].cat.codes.to_numpy() | ||
cats = cats.astype(filters.dtype) | ||
X = _to_dense(X_org, order="F") | ||
new_reg = _create_regressor_categorical(X, cats, filters) | ||
|
||
# Compare the two implementations | ||
np.testing.assert_allclose(new_reg, regressors) | ||
|
||
|
||
def test_regress_out_constants_equivalent(): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this comment mean?