Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different embeddings for same cell when different number of cells embedded #53

Open
dimalvovs opened this issue Nov 7, 2024 · 3 comments

Comments

@dimalvovs
Copy link

Hello, thank you for your brilliant work and sorry for a lame question:

Is it expected that the embedding for the same cell is different based on what other cells it is being embedded with? If so, I assume it's unviable to embed just one cell?

Example

import scanpy as sc
import os
import numpy as np

adata = sc.read_h5ad('data/10k_pbmcs_proc.h5ad')

#10 cell subset
adata10 = adata[0:10, :]
adata10.write_h5ad('data/10_cells_pbmcs_proc.h5ad')
os.system('python eval_single_anndata.py --adata_path data/10_cells_pbmcs_proc.h5ad')
embedded10 = sc.read_h5ad('10_cells_pbmcs_proc_uce_adata.h5ad')

#20 cell subset
adata20 = adata[0:20, :]
adata20.write_h5ad('data/20_cells_pbmcs_proc.h5ad')
os.system('python eval_single_anndata.py --adata_path data/20_cells_pbmcs_proc.h5ad')
embedded20 = sc.read_h5ad('20_cells_pbmcs_proc_uce_adata.h5ad')

np.correlate(embedded10.obsm['X_uce'][0], embedded20.obsm['X_uce'][0])
array([0.7416747], dtype=float32)
@Yanay1
Copy link
Collaborator

Yanay1 commented Nov 7, 2024

Thanks for the question!

There is some very small amount of variance you should expect because of the random sampling process. Correlation should be very high, much higher than that.

However, what you are observing is unrelated to that: we have a small preprocessing step that just filters very lowly expressed genes and cells with low expression:

if additional_filter:

Adding --filter=False should disable this, however it seems like there might be some bug with passing that argument somewhere. In the meantime, if you modify your script to do the first 1000 and 2000 cells, this still leads to differences due to filtering but you should get a correlation around 0.97. (If you do this make sure to change the file names otherwise it will just use the existing 10 and 20 cell processed files).

@dimalvovs
Copy link
Author

Great to know, I thought I missed a major breakthrough on how datasets are embedded, thank you!👍

@dimalvovs
Copy link
Author

just to confirm - I got

>>> np.correlate(embedded10.obsm['X_uce'][0], embedded20.obsm['X_uce'][0])
array([0.946164], dtype=float32)

after disabling the filter.

The reason it does not work is that with type bool, both True and False and any other string that is passed evaluate to True in code.

An easy fix is changing

    parser.add_argument('--filter', type=bool, default=True,
                        help='Additional gene/cell filtering on the anndata.')

to

    parser.add_argument('--filter', default=True, action=argparse.BooleanOptionalAction,
                        help='Additional gene/cell filtering on the anndata.')

which also keeps filtering on by default but also exposes an additional --no-filter option, to turn off if needed.

Same for all type=bool params, and works for Python 3.9 and above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants