-
Notifications
You must be signed in to change notification settings - Fork 10
TAPS data processing
Add paper and protocol information
As described on the main page, install singlecellmultiomics package.
git clone https://github.com/BuysDB/SingleCellMultiOmics
pip install ./SingleCellMultiOmics
Demultiplexing adds UMI, cell, sample, and sequencing index information to the header of the FastQ file.
The demultiplexer uses barcodes available in the barcodes folder. The right set of barcodes is automatically detected if -use is not supplied.
Assume you are located in a directory with fastq files:
example_HYLVHBGX5_S7_L001_R1_001.fastq.gz
example_HYLVHBGX5_S7_L001_R2_001.fastq.gz
example_HYLVHBGX5_S7_L002_R1_001.fastq.gz
example_HYLVHBGX5_S7_L002_R2_001.fastq.gz
example_HYLVHBGX5_S7_L003_R1_001.fastq.gz
example_HYLVHBGX5_S7_L003_R2_001.fastq.gz
example_HYLVHBGX5_S7_L004_R1_001.fastq.gz
To then autodetect the right barcodes and demultiplex run:
demux.py *.gz --y -merge _
It can be useful to additionally supply -hd 1
, which will assign fragments with a single differing base to the closest cell barcode. This gains 5~20% of demultiplexed reads. The effect can be assessed by running with -hd 1
but not supplying the --y
, this will perform a dry run on a small subset of reads.
This will create a folder ./raw_demultiplexed Inside the folder there will be 4 files:
demultiplexedR1.fastq.gz # accepted and demultiplexed R1 reads
demultiplexedR2.fastq.gz # accepted and demultiplexed R2 reads
rejectsR1.fastq.gz # rejected R1 reads
rejectsR2.fastq.gz # rejected R2 reads
removes residual primers, this improves mapping performance:
cutadapt -o trimmed.R1.fastq.gz -p trimmed.R2.fastq.gz demultiplexedR1.fastq.gz demultiplexedR2.fastq.gz -m 3 -a "IlluminaSmallAdapterConcatBait=GGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTT" -a "IlluminaIndexAdapter=GGAATTCTCGGGTGCCAAGGAACTCCAGTCACN{6}ATCTCGTATGCCGTCTTCTGCTTG" -A "IlluminaPairedEndPCRPrimer2.0=AGATCGGAAGAGCGN{6}CAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "universalPrimer=GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT;min_overlap=5" -a "IlluminaGEX=TTTTTAATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGACGATC;min_overlap=5" -a "IlluminaMultiplexingPCRPrimer=GGAACTCCAGTCACN{6}TCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "Aseq=TGGCACCCGAGAATTCCA" -a "Aseq=TGGCACCCGAGAATTCCA" -a "illuminaSmallRNAAdapter=TCGTATGCCGTCTTCTGCTTGT"
Use your mapper of preference to map the reads. Make sure to use a reference which includes spike-ins.
bwa mem -t 4 -M {your.reference} trimmed.R1.fastq.gz trimmed.R2.fastq.gz | samtools view -b - > ./unsorted.bam; samtools sort -T ./temp_sort -@ 4 ./unsorted.bam > ./sorted.unfinished.bam;
mv ./sorted.unfinished.bam ./sorted.bam;
samtools index ./sorted.bam; rm ./unsorted.bam;```
Some of the methylated bases will not be converted by TAPs. The efficiency of the TAPs conversion can be estimated using a fully methylated spike-in. The resulting conversion efficiency can then be estimated:
estimateTapsConversionEfficiency.py ./sorted.bam -o conv_efficiency -ref [reference_fasta_path] -method nla
conv_efficiency
is prefix of the output files.
conv_efficiency_conversions_lambda.csv
First column is the library, second is cell index, third is the amount of mismatches to the reference excluding C's, the other columns contains the estimated conversion efficiency for each CpG motif.
tapsTabulator.py sorted.bam -min_phred_score 20 -ref [reference_fasta_path] | gzip > calls.tsv.gz
Bigwig files allow for visualisation of the methylation levels in a genome browser and allow for quick access of various statistics.
bam_to_methylation_bw.py sorted.bam -o BULK_bw -method nla -t 20 -bin_size 400
create a CSV file with two columns, the first is the barcode index (BI tag) and the second the pseudobulk you want to assign to, for example:
3,cluster_A
4,cluster_A
5,cluster_A
6,cluster_A
7,cluster_B
8,cluster_B
9,cluster_B
10,cluster_B
11,cluster_B
(Stored in my.csv
)
bam_to_methylation_bw.py sorted.bam -o PSEUDO_bw -pseudobulk_BI_csv my.csv -method nla -t 20
This generates beta big wigs for each pseudobulk (cluster_A, and cluster_B).
bam_to_methylation_bw.py sorted.bam --single_sample -o SINGLE_bw -method nla -t 20