-
Notifications
You must be signed in to change notification settings - Fork 15
/
06-Preprocessing.Rmd
406 lines (293 loc) · 15 KB
/
06-Preprocessing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
# Preprocessing of dataset { #preprocess }
## Preprocessing for RNA Velocity
There are three options we have tried (more exist, e.g., kallisto bustools):
- [velocyto.py](http://velocyto.org): the earliest software for this purpose.
Generally not computationally efficient, possible due to written in Python.
For unknown reason, the proportion of unspliced RNA is unrealisticly high for
5' 10x Genomics data.
- [dropEst](https://dropest.readthedocs.io): as implemented in C/C++, it is much
more efficient. It also returns more reasonable proportion of unspliced RNAs
for 5' 10x Genomics data
- [STAR-solo](https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md):
new extension for the popular STAR. Benefits: efficient and one step for reads
alignment and counting of unspliced RNA (**Recommended option**)
**Please always read the manual of a specific software you are using.**
In this tutorial, we would take the Pancreas dataset used in scVelo paper for example, show how to retrieve the fastq dataset from NCBI, process using cellranger, velocyto or dropest to obtain the spliced and unspliced count matrices for RNA velocity estimation.
### List of packages aligned in pipeline:
1. [SRATool](https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit): for downlowding datasets from NCBI, Pancreas dataset E15_5 with SRA id SRR9201794
2. [Cellranger](https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest): for counting, analyzing, QA on fastqs, outputs input for velocyto/dropest
3. Alternative tools for counting spliced and unspliced reads from Cellranger outputs/
- [Velocyto](http://velocyto.org/velocyto.py/tutorial/cli.html#introduction)
- [samtools](http://www.htslib.org/): Velocyto's dependency pakcage
- [dropest](https://dropest.readthedocs.io/en/latest/dropest.html)
### Installation of packages/softwares
All the codes are executed on Linux commandline, lines starting with "#" are comments.
My working path root is "/data/users/cqiao/share/",
you may change to any foler as you like.
#### SRATool
Refer to [This tutorial](https://akiomiyao.github.io/ped/sratoolkit/index.html) for step 3. software configuration
```
# 1. download software
wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz
# 2. extract package
tar -xzf sratoolkit.tar.gz
# 3. configuration software
./sratoolkit.2.11.0-ubuntu64/vdb-config --interactive
# 4. add software to system path so that we can easily call it
export PATH=$PATH:/data/users/cqiao/share/sratoolkit.2.11.0-ubuntu64/bin/
# 5. test installation, expected to output some sequencing results
fastq-dump --stdout -X 2 SRR390728
```
#### Cellranger
```
# 1. download software
wget -O cellranger-6.0.2.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-6.0.2.tar.gz?Expires=1625079682&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZi4xMHhnZW5vbWljcy5jb20vcmVsZWFzZXMvY2VsbC1leHAvY2VsbHJhbmdlci02LjAuMi50YXIuZ3oiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkFXUzpFcG9jaFRpbWUiOjE2MjUwNzk2ODJ9fX1dfQ__&Signature=Ez05ie1~xJTrUhsc3Q1nZ3JemF5Jc2k3zOOC2PlJiI80TI8YOxM3vVXli90ib1Kl4WgQFRl5~Wp-ce5m2gVAyywGf9juiMk3kW2WGz7xy3DN9nCG-he0rNLvbknDXgkqkj8HeirWyt-EvGgsaqBRZZx7GppGl~MGQOseLTF7DMpJaIm-qwU-72mkVzDy0vKrYVZ~Uiqnlr0M2Hhx14lPwm6QocBuSaKLkmcuZosN4TIS05Z~GPHYt9hQWkykvY4oyd5Y~tyTaMOM8ozVVdI7KgCfc2PmQvhr1bMuhUaEPa3skigzAFjIezbw-fXW3Z8jplwj4lAQ0LybI03ozZr6jg__&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA"
# 2. extract files
tar -xzf cellranger-6.0.2.tar.gz
# 3. add software to system path so that we can easily call it
export PATH=$PATH:/data/users/cqiao/share/cellranger-6.0.2
# 4. download gene refereces (optional saving path: ./cellranger/gtfs/)
cd cellranger-6.0.2
mkdir gtfs
cd gtfs
# human (optional for your needs)
wget https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
tar -xzf refdata-gex-GRCh38-2020-A.tar.gz
# mouse (we shall use this for Pancreas dataset)
wget https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-mm10-2020-A.tar.gz
tar -xzf refdata-gex-mm10-2020-A.tar.gz
```
#### Velocyto & Samtools
##### [Samtools](http://www.htslib.org/download/)
```
# 1. download & extract software
wget https://github.com/samtools/samtools/releases/download/1.12/samtools-1.12.tar.bz2
tar -jxvf samtools-1.12.tar.bz2
cd samtools-1.12
# 2. configure build, specify --prefix to locate the root dir
./configure --prefix=/data/users/cqiao/share/samtools
# 3. compile and install software
make
make install
# 4. add software to system path so that we can easily call it
PATH=$PATH:/data/users/cqiao/share/samtools/bin/
# go back to parant folder
cd ..
```
##### [Velocyto](http://velocyto.org/velocyto.py/install/index.html)
Assume your python packages are managed by conda and pip
```
# install dependent packages, may also use pip
conda install numpy scipy cython numba matplotlib scikit-learn h5py click
# install velocyto
pip install velocyto
```
#### [Dropest](https://dropest.readthedocs.io/en/latest/setup.html)
Install dependent libraries
Install RPackages (in R)
```
install.packages(c("Rcpp","RcppEigen", "RInside", "Matrix"))
```
Setup a local lib path:
```
export LOCAL_LIBS=/data/users/cqiao/share/local
mkdir $LOCAL_LIBS
export PATH=$LOCAL_LIBS/bin:$LOCAL_LIBS/usr/local/bin/:$PATH
```
Install CMake
```
wget https://cmake.org/files/v3.12/cmake-3.12.0-rc1.tar.gz
tar xvf cmake-3.12.0-rc1.tar.gz
cd cmake-3.12.0-rc1
./bootstrap --prefix=$LOCAL_LIBS
make
make install
cd ..
```
Zlib
```
wget https://zlib.net/zlib-1.2.11.tar.gz
tar xvf zlib-1.2.11.tar.gz
cd zlib-1.2.11
./configure --prefix=$LOCAL_LIBS
make
make install
cd ..
```
BamTools
```
git clone https://github.com/pezmaster31/bamtools.git
cd bamtools
mkdir build && cd build
cmake ../
make
make install DESTDIR=$LOCAL_LIBS
cd ..
```
Bzip2
```
wget https://www.sourceware.org/pub/bzip2/bzip2-latest.tar.gz
tar -xzvf bzip2-latest.tar.gz
cd bzip2-1.0.8
make -f Makefile-libbz2_so
make install PREFIX=$LOCAL_LIBS
cp -a libbz2.so* $LOCAL_LIBS/lib/
ln -s $LOCAL_LIBS/lib/libbz2.so.1.0 $LOCAL_LIBS/lib/libbz2.so
cd ..
```
Boost
```
wget http://sourceforge.net/projects/boost/files/boost/1.60.0/boost_1_60_0.tar.gz
tar xzf boost_1_60_0.tar.gz
cd boost_1_60_0
./bootstrap.sh
./b2 cxxflags="-std=c++11" include="$LOCAL_LIBS/include/" search="$LOCAL_LIBS/lib/" link=shared threading=multi install --prefix=$LOCAL_LIBS
cd ..
```
DropEst
```
git clone https://github.com/hms-dbmi/dropEst.git
mkdir dropEst/build
cd dropEst/build
cmake -D CMAKE_INSTALL_PREFIX="/data/users/cqiao/share/dropesttool" .. && make
export PATH=$PATH:/data/users/cqiao/share/dropesttool/bin/
```
### Preprocessing pipline
#### 1. Download and split Pancreas E15_5 fastq data from NCBI:
```
fasterq-dump --split-files SRR9201794
```
Result would be:
```
SRR9201794_1.fastq SRR9201794_2.fastq
```
Rename files for cellranger:
```
mv SRR9201794_1.fastq SRR9201794_S1_L001_R1_001.fastq
mv SRR9201794_2.fastq SRR9201794_S1_L001_R2_001.fastq
```
Optinally, move them to a folder `/data/users/cqiao/share/E15_5`
#### 2. Run Cellranger count, referring to the mouse refdata
```
cellranger count --id=pancreas_E15_5 \
--fastqs=/data/users/cqiao/share/E15_5 \
--transcriptome=/data/users/cqiao/share/cellranger-6.0.2/gtfs/refdata-gex-mm10-2020-A
```
`--id` specifies the output dir, the output root.
#### 3. Run Velocyto/Dropest.
prepare some variables to store path names, so that afterwards we do not need to use the long names.
```
# Barchodes if cellranger output files:
gzip -d ./pancreas_E15_5/outs/filtered_feature_bc_matrix/barcodes.tsv.gz
BARCODES=./pancreas_E15_5/outs/filtered_feature_bc_matrix/barcodes.tsv
# Bam file
BAMFILE=./pancreas_E15_5/outs/possorted_genome_bam.bam
# transcriptome reference
GTF=/data/users/cqiao/share/cellranger-6.0.2/gtfs/refdata-gex-mm10-2020-A/genes/genes.gtf
```
- __Velocyto__
```
velocyto run -b $BARCODES -o ./velocyto_res $BAMFILE $GTF
```
Results are stored in `./velocyto_res/XXX.loom`
- __Dropest__
Requires a .xml configuration file specifying the platform settings for sequencing data. Template configure file could be downloaded [here](https://github.com/kharchenkolab/dropEst/tree/master/configs). We shall use 10x.xml for Pancreas E15_5 dataset. Replace line 64 with the path of 10x.xml:
```
...
<barcodes_file>./pancreas_E15_5/outs/filtered_feature_bc_matrix/barcodes.tsv</barcodes_file>
...
```
Then execute the following code:
```
$bin/dropest -f -V -g $GTF -c $config_file $BAMFILE
```
Results are stored in file `./pancreas_E15_5/dropEst/cell.counts.matrices.rds`.
## Preprocessing for somatic mutation analysis { #preprocess_mutations }
### Pileup with cellsnp-lite
Cellsnp-lite is designed to perform efficient pileup and genotyping for both bulk and single cell sequencing data. It could be easily installed via [conda](#conda_install) with `conda install -c bioconda -c conda-forge cellsnp-lite`. Note that, cellsnp-lite does not support Windows, but you can use it with the help of the `Windows Subsystem for Linux (WSL)`. Details about WSL can be found in section below.
We use a SMART-seq2 dataset as an example, which were packed into `joxm.bam.all.77.tar.gz` (77 bam files, ~3.5G in total; [McCarthy et al, 2020](https://www.nature.com/articles/s41592-020-0766-3)).
First, please download `joxm.bam.all.77.tar.gz`, `joxm.hg19.bam.lst` and `joxm.sample.lst` from [sgcellworkshop](https://sourceforge.net/projects/sgcellworkshop/files/mtDNA_analysis/) repo on sourceforge. Put the three files in the same directory. Then unpack the bam files with command `tar zxvf joxm.bam.all.77.tar.gz`. The bam files, together with the `.bai` files, should be in the `sort/` directory.
#### Pileup without given SNPs
In this example, we use cellsnp-lite to pileup chrM on the bam files without a given SNP list (in denovo mode), whose output could be used as inputs of MQuad model for mitochondrial clone analysis.
Note that the pileup results `joxm.hg19.cellsnp.mode2b.tar.gz` can be downloaded from [sgcellworkshop](https://sourceforge.net/projects/sgcellworkshop/files/mtDNA_analysis/) repo on sourceforge and then unpacked with command `tar zxvf joxm.hg19.cellsnp.mode2b.tar.gz`.
The command line to generate the results is,
```{shell}
cellsnp-lite \
-S ./joxm.hg19.bam.lst \
-i ./joxm.sample.lst \
-O ./cellsnp2b \
--chrom MT \
--cellTAG None \
--UMItag None \
--minCOUNT 20 \
--minMAF 0 \
--genotype \
--gzip
```
#### Pileup with given SNPs
This time, we use cellsnp-lite to pileup given SNPs on the bam files. The 112 SNPs, stored in `joxm.hg19.somatic.sites.vcf`, were somatic SNPs called in ([McCarthy et al, 2020](https://www.nature.com/articles/s41592-020-0766-3)) and can be downloaded from [sgcellworkshop](https://sourceforge.net/projects/sgcellworkshop/files/mtDNA_analysis/) repo.
Note that the pileup results `joxm.hg19.somatic.cellsnp.mode1b.tar.gz` can be downloaded from [sgcellworkshop](https://sourceforge.net/projects/sgcellworkshop/files/mtDNA_analysis/) repo on sourceforge and then unpacked with command `tar zxvf joxm.hg19.somatic.cellsnp.mode1b.tar.gz`.
The command line to generate the results is,
```{shell}
cellsnp-lite \
-S ./joxm.hg19.bam.lst \
-i ./joxm.sample.lst \
-O ./cellsnp1b \
-R ./joxm.hg19.somatic.sites.vcf \
--cellTAG None \
--UMItag None \
--minCOUNT 0 \
--minMAF 0 \
-p 4 \
--genotype \
--gzip
```
### Clonal analysis with MQuad
The command line below is used to generate the MQuad results:
```{shell}
mquad -c ./cellsnp2b -o ./mquad_dat -p 4
```
## (Optional) Install Windows Subsystem for Linux {#wsl_install}
**Note, this is for Windows users only.** While some required softwares only support Linux or macOS, you could install WSL to use Linux inside Windows.
The whole process of installing WSL requires at least 2G disk space. Note that this process was tested on Windows 10 (Version 2004, build 19041.1052).
### What is the Windows Subsystem for Linux (WSL)?
According to the [Microsoft Docs](https://docs.microsoft.com/en-us/windows/wsl/about),
>*"The Windows Subsystem for Linux lets developers run a GNU/Linux environment -- including most command-line tools, utilities, and applications -- directly on Windows, unmodified, without the overhead of a traditional virtual machine or dualboot setup."*
### Manual Installation Steps
**Step 1. Enable required feature in Windows PowerShell**
It is necessary to enable the required feature for WSL before installing it.
- Type `powershell` in the search box of the Windows taskbar.
- Right click `Windows PowerShell` and select `Run as administrator`.
```{r wsl_fig1, echo=FALSE, message=FALSE, warning=FALSE, fig.align="center", out.width="45%"}
knitr::include_graphics("./images/wsl_open_powershell.png")
```
- Type the command below in PowerShell.
```{shell}
dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux
```
```{r wsl_fig2, echo=FALSE, message=FALSE, warning=FALSE, fig.align="center", out.width="60%"}
knitr::include_graphics("./images/wsl_enable_feature.png")
```
**Step 2. Download and install WSL**
Microsoft now supports several Linux distributions as WSL, such as Ubuntu, openSUSE, Fedora, etc ([a full list here](https://docs.microsoft.com/en-us/windows/wsl/install-win10#step-6---install-your-linux-distribution-of-choice)), among which we choose Ubuntu as an example.
Ubuntu WSL could be freely downloaded and installed through [Microsoft Store](https://www.microsoft.com/en-us/p/ubuntu-2004-lts/9n6svws3rx71?rtc=1&activetab=pivot:overviewtab).
- Go to the webpage for [Ubuntu](https://www.microsoft.com/en-us/p/ubuntu-2004-lts/9n6svws3rx71?rtc=1&activetab=pivot:overviewtab) in Microsoft Store.
- Click on the `Get` button.
- Wait for completion of downloading and installation.
- Click on the `Launch` button.
```{r wsl_fig3, echo=FALSE, message=FALSE, warning=FALSE, fig.align="center", out.width="60%"}
knitr::include_graphics("./images/wsl_download_install.png")
```
**Step 3. Create a new account for Ubuntu**
After successfully installing Ubuntu, a new user account should be created.
- Type user name and password following the prompts on the screen. Note, it is normal that the password is invisible when you are typing.
```{r wsl_fig4, echo=FALSE, message=FALSE, warning=FALSE, fig.align="center", out.width="40%"}
knitr::include_graphics("./images/wsl_create_account.png")
```
**Now Congratulations! You have successfully installed and set up Ubuntu in your Windows System!**
Next time you can re-open Ubuntu through the search box of the Windows taskbar.
```{r wsl_fig5, echo=FALSE, message=FALSE, warning=FALSE, fig.align="center", out.width="35%"}
knitr::include_graphics("./images/wsl_open_ubuntu.png")
```
More information about the usage of WSL can be found at [Microsoft Docs](https://docs.microsoft.com/en-us/windows/wsl).