A unified approach for integrating spatial and single-cell transcriptomics data by leveraging deep generative models
Visit our documentation for installation, tutorials, examples and more.
$ git clone https://github.com/YangLabHKUST/SpatialScope.git
$ cd SpatialScope
$ conda env create -f environment.yml
$ conda activate SpatialScope
# fix bug of squidpy, locate the lib with `which python`
$ rsync ./src/_feature_mixin.py ~/.conda/envs/SpatialScope/lib/python3.9/site-packages/squidpy/im/_feature_mixin.py
check the installation status
$ python ./src/Cell_Type_Identification.py -h
If the installation is unsuccessful, you can consider using docker instead. Pull SpatialScope docker image from dockerhub, make sure docker and nvidia-container-toolkit have been installed first.
$ docker pull xiaojs95/spatialscope
$ docker images
Usage
$ docker run -it --gpus all --ipc=host xiaojs95/spatialscope /bin/bash
update repository if necessary
$ git pull
check the installation status
$ python ./src/Cell_Type_Identification.py -h
We provide source codes for reproducing the SpatialScope analysis in the main text in the demos
directory.
All relevent materials involved in the reproducing codes are availabel from here
- Benchmarking Dataset 1
- Benchmarking Dataset 2
- Benchmarking Dataset 3
- Benchmarking Dataset 4
- Benchmarking Dataset 5
- Benchmarking Dataset 6
- Human Heart (Visium, a single slice)
- Mouse Brain (Visium, 3D alignment of multiple slices)
- Mouse Cerebellum (Slideseq-V2)
- Mouse MOp (MERFISH)
We illustrate the usage of SpatialScope using a single slice of 10x Visium human heart data:
- Spatial data: ./demo_data/V1_Human_Heart_spatial.h5ad
- Image data: ./demo_data/V1_Human_Heart_image.tif
- scRNA reference data: ./Ckpts_scRefs/Heart_D2/Ref_Heart_sanger_D2.h5ad
- Pretrained model checkpoint: ./Ckpts_scRefs/Heart_D2/model_5000.pt (only required for Step3)
All relevent materials involved in the following example are availabel from here
python ./src/Nuclei_Segmentation.py --tissue heart --out_dir ./output --ST_Data ./demo_data/V1_Human_Heart_spatial.h5ad --Img_Data ./demo_data/V1_Human_Heart_image.tif
Input:
- --out_dir: output directory
- --tissue: output sub-directory
- --ST_Data: ST data file path
- --Img_Data: H&E stained image data file path (require raw H&E image with high resolution, about 10000x10000 resolution, 500M file size)
This step will take about 5 mins and make ./output/heart
directory, and generate two files:
- Visualization of nuclei segmentation results: nuclei_segmentation.png
- Preprocessed ST data for cell type identification: sp_adata_ns.h5ad (cell_locations that contains spatial locations of segmented cells will be added to .uns)
python ./src/Cell_Type_Identification.py --tissue heart --out_dir ./output --ST_Data ./output/heart/sp_adata_ns.h5ad --SC_Data ./Ckpts_scRefs/Heart_D2/Ref_Heart_sanger_D2.h5ad --cell_class_column cell_type
Input:
- --out_dir: output directory
- --tissue: output sub-directory
- --ST_Data: ST data file path (generated in Step 1)
- --SC_Data: single-cell reference data file path (When using your own scRef file, we recommend adding a Marker column to the .var to pre-select several thousand marker or highly variable genes as in "./Ckpts_scRefs/Heart_D2/Ref_Heart_sanger_D2.h5ad")
- --cell_class_column: cell class label column in scRef file
This step will take about 10 mins and generate three files:
- Visualization of cell type identification results: estemated_ct_label.png
- Cell type identification results: CellTypeLabel_nu10.csv
- Preprocessed ST data for gene expression decomposition: sp_adata.h5ad
Now we can use the sp_adata.h5ad
to visualize the single-cell resolution spatial distribution of different cell types:
ad_sp = sc.read('./output/heart/sp_adata.h5ad')
fig, ax = plt.subplots(1,1,figsize=(12, 8),dpi=100)
PlotVisiumCells(ad_sp,"discrete_label_ct",size=0.3,alpha_img=0.3,lw=0.8,ax=ax)
more details are available in jupyter notebook Human Heart (Visium, a single slice).
In Step3, by conditioning on the inferred cell type labels from Step2, SpatialScope performs gene expression decomposition, transforming the spot-level gene expression profile into single-cell resolution. To do this, we first learn a score-based generative model to approximate the expression distribution of different cell types from the single-cell reference data. Then we use the learned model to decompose gene expression from the spot level to the single-cell level, while accounting for the batch effect between single-cell reference and ST data.
python ./src/Decomposition.py --tissue heart --out_dir ./output --SC_Data ./Ckpts_scRefs/Heart_D2/Ref_Heart_sanger_D2.h5ad --cell_class_column cell_type --ckpt_path ./Ckpts_scRefs/Heart_D2/model_5000.pt --spot_range 0,100 --gpu 0,1,2,3
Input:
- --out_dir: output directory
- --tissue: output sub-directory
- --SC_Data: single-cell reference data file path
- --cell_class_column: cell class label column in scRef file
- --ckpt_path: model checkpoint file path (As the model checkpoint was trained on scRef file, the checkpoint and scRef file much be matched)
- --spot_range: limited by GPU memory, we can only handle at most about 1000 spots in 4 GPUs at a time. e.g., 0,1000 means 0 to 1000-th spot
- --gpu: Visible GPUs
This step will take about 10 mins and generate one file:
- Single-cell resolution ST data generated by SpatialScope for spot 0-100: generated_cells_spot0-100.h5ad
The scRNA-seq reference ./Ckpts_scRefs/Heart_D2/Ref_Heart_sanger_D2.h5ad
was preprocessed following the standard precedures, more details are available in jupyter notebook Human Heart (Visium, a single slice). In order to make the distribution learning process more efficient, we only learned the gene expression distributions of 2,000 selected highly variable genes. Besides, we subsampled the number of cells per cell type, up to a maximum of 3,000.
We use four RTX 2080 Ti GPUs to train scRNA-seq reference in parallel.
python ./src/Train_scRef.py \
--ckpt_path ./Ckpts_scRefs/Heart_D2 \
--scRef ./Ckpts_scRefs/Heart_D2/Ref_Heart_sanger_D2.h5ad \
--cell_class_column cell_type \
--gpus 0,1,2,3 \
--sigma_begin 50 --sigma_end 0.002 --step_lr 3e-7
The checkpoints and sampled psuedo-cells will be saved in ./Ckpts_scRefs/Heart_D2
, e.g, model_5000.pt, model_5000.h5ad. The pre-trained checkpoint can be used for any spatial data from the same tissue.
Due to the low sequencing depth (~2000 UMIs per cell) of this Human Heart scRNA-seq reference, we changed the default parameters of sigma_begin, sigma_end and step_lr.
As the sampling process of diffusion/score-based models requires hundreds to thousands of network evaluations to emulate a continuous process, the entire training process takes approximately 40 hours on four RTX 2080 Ti GPUs. Therefore, we are trying to accelarate the training process with some new technologies in the field of diffusion model, such as stable diffusion.
Conveniently, we provided the pre-trained checkpoint (Ckpts_scRefs/Heart_D2/model_5000.pt) in here, so you can skip this part.
-
I have access to a 3090 alternatively 2x V100-SXM2. Will that work for imputing onto a 200,000 cell MERFISH dataset?
Answer: The minimum GPU requirement for SpatialScope is 2080 Ti. However, limited by GPU memory, we recommend impute 1000 cells at a time, more details are availabel in demo notebook Mouse MOp (MERFISH).
Please contact Xiaomeng Wan ([email protected]), Jiashun Xiao ([email protected]) or Prof. Can Yang ([email protected]) if any enquiry.