different embeddings for same cell when different number of cells embedded #53

dimalvovs · 2024-11-07T16:08:21Z

Hello, thank you for your brilliant work and sorry for a lame question:

Is it expected that the embedding for the same cell is different based on what other cells it is being embedded with? If so, I assume it's unviable to embed just one cell?

Example

import scanpy as sc
import os
import numpy as np

adata = sc.read_h5ad('data/10k_pbmcs_proc.h5ad')

#10 cell subset
adata10 = adata[0:10, :]
adata10.write_h5ad('data/10_cells_pbmcs_proc.h5ad')
os.system('python eval_single_anndata.py --adata_path data/10_cells_pbmcs_proc.h5ad')
embedded10 = sc.read_h5ad('10_cells_pbmcs_proc_uce_adata.h5ad')

#20 cell subset
adata20 = adata[0:20, :]
adata20.write_h5ad('data/20_cells_pbmcs_proc.h5ad')
os.system('python eval_single_anndata.py --adata_path data/20_cells_pbmcs_proc.h5ad')
embedded20 = sc.read_h5ad('20_cells_pbmcs_proc_uce_adata.h5ad')

np.correlate(embedded10.obsm['X_uce'][0], embedded20.obsm['X_uce'][0])
array([0.7416747], dtype=float32)

The text was updated successfully, but these errors were encountered:

Yanay1 · 2024-11-07T18:21:41Z

Thanks for the question!

There is some very small amount of variance you should expect because of the random sampling process. Correlation should be very high, much higher than that.

However, what you are observing is unrelated to that: we have a small preprocessing step that just filters very lowly expressed genes and cells with low expression:

UCE/data_proc/data_utils.py

Line 206 in cba0d59

if additional_filter:

Adding --filter=False should disable this, however it seems like there might be some bug with passing that argument somewhere. In the meantime, if you modify your script to do the first 1000 and 2000 cells, this still leads to differences due to filtering but you should get a correlation around 0.97. (If you do this make sure to change the file names otherwise it will just use the existing 10 and 20 cell processed files).

dimalvovs · 2024-11-07T22:08:16Z

Great to know, I thought I missed a major breakthrough on how datasets are embedded, thank you!👍

dimalvovs · 2024-11-08T21:47:49Z

just to confirm - I got

>>> np.correlate(embedded10.obsm['X_uce'][0], embedded20.obsm['X_uce'][0])
array([0.946164], dtype=float32)

after disabling the filter.

The reason it does not work is that with type bool, both True and False and any other string that is passed evaluate to True in code.

An easy fix is changing

    parser.add_argument('--filter', type=bool, default=True,
                        help='Additional gene/cell filtering on the anndata.')

to

    parser.add_argument('--filter', default=True, action=argparse.BooleanOptionalAction,
                        help='Additional gene/cell filtering on the anndata.')

which also keeps filtering on by default but also exposes an additional --no-filter option, to turn off if needed.

Same for all type=bool params, and works for Python 3.9 and above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

different embeddings for same cell when different number of cells embedded #53

different embeddings for same cell when different number of cells embedded #53

dimalvovs commented Nov 7, 2024

Yanay1 commented Nov 7, 2024

dimalvovs commented Nov 7, 2024

dimalvovs commented Nov 8, 2024

different embeddings for same cell when different number of cells embedded #53

different embeddings for same cell when different number of cells embedded #53

Comments

dimalvovs commented Nov 7, 2024

Yanay1 commented Nov 7, 2024

dimalvovs commented Nov 7, 2024

dimalvovs commented Nov 8, 2024