Processing workflow for covid19 single cell data
- Publication: Single-cell immunology of SARS-CoV-2 infection
- Explore and download data: Atlas
This project uses renv for package management.
To activate, simply open an R session in the top-level directory of the repo. This will prompt the
installation of the correct version of the renv
package. Then call renv::restore()
to install all packages to a project-specific library.
If you wish to disable renv for this project, simply remove or comment out the following line from
the project .Rprofile
:
source("renv/activate.R")
Scripts are designed to be run from the project directory. This will enable renv
as well as load environment variables from .Renviron
.
Paths to data directories are set in the project-local .Renviron
. All R scripts set paths relative to paths supplied in .Renviron
. If you wish to run any scripts on your machine, simply edit the .Renviron
file to point to the appropriate directories. There are also several bash scripts which wrap R scripts. Set paths manually in these scripts if you wish to run them. .Renviron
variables are used as follows:
DATA_DIR
-- All objects are saved and loaded in locations relative to this directory. It follows this format:
DATA_DIR
├── data/
│ └── __raw data for all datasets__
├── h5seurat_reprocessed/
│ └── __output h5seurat files__
├── metadata/
│ └── __metadata is written here for intermediate standardization__
├── plots
│ └── __process.R writes some diagnostic plots here__
└── seu_reprocessed/
└── __Preprocessed and processed Seurat objects are written here as .rds files__
-
REFERENCE_DATASET
-- Path to a local version of the reference dataset from Hao et al available for download here. -
DEBUG_DIR
-- Path to a directory for saving intermediate versions of the data for debugging. Utilized byprocess.R
.
The outline for processing all datasets are as follows:
- Pull datasets (sources noted below) and convert into Seurat objects using scripts in the
preprocess/
directory. run_all.sh
: Process counts data for all datasets. Perform processing steps on assay data, including normalization, dimension reduction, cluster identification, cell type annotation, and reference mapping. More details below.- Add cell type annotations to whole blood datasets using singleR. This is used when creating the combined dataset, as non-PBMC cells are removed.
run_update_metadata.sh
: Standardize metadata and update processed objects. More details below and in themetadata/
directory.run_merge.sh
: Merge PBMC and whole blood datasets into a combined dataset. More details below and in themerge/
directory.run_differential_expression_all.sh
: Identify differentially expressed marker genes for all cell types. Results are saved as csv files and merged into themisc
slot in the h5seurat objects.
First we downloaded data from public sources. Links for each source dataset are included below. Source data was prepared for processing by loading into a Seurat object and saving the result as an rds file. Scripts for preprocessing each dataset are provided in the preprocess/
directory.
21 single cell datasets were processed using the Seurat package. Low-quality cells were filtered using basic QC thresholds, normalization was performed using SCtransform, a PCA was performed, clusters were identified using the leiden algorithm, and a UMAP was calculated for visualization. The results were saved to the final Seurat object.
All datasets were mapped to a reference PBMC dataset, using Seurat V4 methods and the PBMC reference supplied with Seurat V4. The output from this analysis is saved with the processed dataset, and includes the reference UMAP as ref.umap
, predicted protein expression values as the assay predicted_ADT
, and cell type annotations.
Three datasets also included protein expression data. On these datasets, we also normalized protein expression data, ran a PCA on the result, and identified clusters from the PCA. Furthermore, a weighted nearest neighbor graph was calculated to integrate protein and gene expression data using Seurat V4, and clusters and a UMAP were calculated from the result.
process.R
runs basic QC and processing steps on a Seurat object. process.sh
and run.sh
are wrappers which will set off a slurm job to process a dataset. run_all.sh
will set off slurm jobs for all datasets, and can be used as a reference for which parameters were used for different datasets.
For each dataset, once process.R
has completed and the processed version is saved, metadata is extracted and standardized, with standard fields merged back into the original h5seurat file. For more details about how the metadata scripts work, see the README.md file in the metadata directory. The bash run_update_metadata.sh
will run all the metadata standardization scripts in order, and can be used as a reference for how metadata standardization was done.
Standard disease_severity
scores and days_since_symptom_onset
were assigned to each COVID-19 patient. For patients were either of these fields were missing, they were estimated as follows:
The standardized severity score is based as closely as possible on this model, based on information provided in datasets and supplementary materials:
mild
= not hospitalizedmoderate
= hospitalized (no ICU)severe
= ICU
Where provided, this is the number of days between symptom onset and the sampling time. It was estimated as follows for patients where it was not provided:
-
If days since hospitalization is known, 7 days are added to that number. This is based on the median difference between symptom onset and hospitalization for the 14 patients where both dates were reported.
-
For two datasets where neither days since hospitalization nor days since symptom onset are known for one or more COVID-19 patients:
-
su_2020:
T1
(sample near the time of diagnosis) was assigned7
, with the assumption that diagnosis and hospitalization would happen on a similar timeline.T2
(approximately one week later) was assigned14
. -
wen_2020: Early recovery stage (ERS) patients were assigned
15
. Late recovery stage (LRS) patients were assigned29
. These numbers are rough estimates, based on the CDC guidelines for healthcare workers (https://www.cdc.gov/coronavirus/2019-ncov/hcp/duration-isolation.html).
1. The early-recovery stage (ERS) of COVID-19 patients are defined as the date of nucleic acid turning negative to blood sampling is less than seven days.
2. The late-recovery stage (LRS) of COVID-19 patients are defined as the date of nucleic acid turning negative to blood sampling is more than fourteen days. -
yao_2021: Samples from the same severity group were pooled, and due to an issue with hashing, while severity group each for each sample is known, they cannot be mapped to specific patients. Thus, we used the median days since symptom onset and hospitalization for each severity group.
-
bost_2021: Authors reported the time from symptom onset to admission, but not the time until sampling. We assumed that samples were taken on the same day as admission, so used the median of the number of days between symptom onset and admission for each severity group, reported in Table 2.
-
stephenson_2021 and combes_2021: Both datasets were missing information about days since symptom onset for several patients. For those missing records, we used the median from the corresponding severity group where those numbers were not missing.
As only cell types in the Seurat v4 multimodal CITE-seq PBMC reference set can be correctly mapped, for visualization purposes we next took only the PBMC and whole blood datasets (16 in total, the latter minus neutrophils and basophils) and created a “merged dataset” consisting of over 2.5 million cells. This includese standard metadata, UMAP, SCT expression values, and predicted protein scores.
The full, merged counts matrix is too large to be stored as a sparse matrix in R due to technical limitations. Thus, there are several steps required to create the hdf5 file required for the website. More details are available in the README.md file in the merge
directory. The file merge.sh
performs all the steps.
Links for data sources as provided by the original publication are available below. Processed datasets ready for analysis are available at https://atlas.fredhutch.org/fredhutch/covid/ as Seurat objects or h5seurat files.
Data were downloaded from the following locations:
We acknowledge the Scientific Computing Infrastructure at Fred Hutch funded by ORIP grant S10OD028685.