scSampler
is a Python pacakge for fast diversity-preserving subsampling of large-scale single-cell transcriptomic data.
Please install it from PyPI:
pip install scsampler
First we load all modules.
import numpy as np
import pandas as pd
import scanpy as sc
from time import time
from scsampler import scsampler
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')
The example data can be downloaded from https://doi.org/10.5281/zenodo.5811787 in the anndata
format by scanpy
. Here we use the ~68'000 PBMC cells. Please modify the path as your own path.
adata = sc.read_h5ad('/home/dongyuan/scSampler/data/final_h5ad/pbmc68k.h5ad')
Subsample 10% cells and return a new anndata. The space is top PCs.
adata_sub = scsampler(adata, fraction = 0.1, copy = True)
If you want to speed it up, you can use the random_split
. It will lead to slightly less optimal result, of course.
start = time()
adata_sub = scsampler(adata, fraction = 0.1, obsm = 'X_pca', copy = True, random_split = 16)
end = time()
print(end - start)
You can also use the numpy.ndarray
as the input.
mat = adata.obsm['X_pca']
print(type(mat))
res = scsampler(mat, fraction = 0.1, copy = True, random_split = 16)
subsample_index = res[1]
subsample_mat = res[0]
Any questions or suggestions on scSampler
are welcomed! If you have any questions, please report it on issues or contact Dongyuan ([email protected]).