-
Notifications
You must be signed in to change notification settings - Fork 10
scCHIC data processing
- Go to the folder containing the raw reads. Then run:
scmo_workflow.py chic
This creates the Snakefile containing all the required steps and a config file.
-
Edit the config file (config.json), make sure to set the right path to a reference Fasta file and optionally pick a different mapper or bin sizes.
-
Execute the workflow locally by running
snakemake
or run on a cluster by using of the commands shown byscmo_workflow.py
.
The workflow code can be found here
Here the steps required for executing the pipeline by hand are described
To obtain a count table the following steps have to be performed:
- Demultiplexing
- Adapter trimming
- Mapping
- Molecule assignment
- Count table generation
Demultiplexing adds UMI, cell, sample and sequencing index information to the header of the FastQ file. This information is later used to figure out which reads belong to the same molecule and to what cell every molecule belongs. The demultiplexer uses barcodes available in the barcodes folder. The right set of barcodes is automatically detected if -use is not supplied. The demultiplexer also trims the random primer from read 2 and stores it as meta-information, this information is later used for IVT deduplication.
Assume you are located in a directory with fastq files:
example_HYLVHBGX5_S7_L001_R1_001.fastq.gz
example_HYLVHBGX5_S7_L001_R2_001.fastq.gz
example_HYLVHBGX5_S7_L002_R1_001.fastq.gz
example_HYLVHBGX5_S7_L002_R2_001.fastq.gz
example_HYLVHBGX5_S7_L003_R1_001.fastq.gz
example_HYLVHBGX5_S7_L003_R2_001.fastq.gz
example_HYLVHBGX5_S7_L004_R1_001.fastq.gz
To then autodetect the right barcodes and demultiplex run:
demux.py *.gz --y -merge _
It can be useful to additionally supply -hd 1
, which will assign fragments with a single differing base to the closest cell barcode. This gains 5~20% of demultiplexed reads. The effect can be assessed by running with -hd 1
but not supplying the --y
, this will perform a dry run on a small subset of reads.
This will create a folder ./raw_demultiplexed Inside the folder there will be 4 files:
demultiplexedR1.fastq.gz # accepted and demultiplexed R1 reads
demultiplexedR2.fastq.gz # accepted and demultiplexed R2 reads
rejectsR1.fastq.gz # rejected R1 reads
rejectsR2.fastq.gz # rejected R2 reads
removes residual primers, this improves mapping performance:
cutadapt -o trimmed.R1.fastq.gz -p trimmed.R2.fastq.gz demultiplexedR1.fastq.gz demultiplexedR2.fastq.gz -m 3 -a "IlluminaSmallAdapterConcatBait=GGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTT" -a "IlluminaIndexAdapter=GGAATTCTCGGGTGCCAAGGAACTCCAGTCACN{6}ATCTCGTATGCCGTCTTCTGCTTG" -A "IlluminaPairedEndPCRPrimer2.0=AGATCGGAAGAGCGN{6}CAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "universalPrimer=GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT;min_overlap=5" -a "IlluminaGEX=TTTTTAATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGACGATC;min_overlap=5" -a "IlluminaMultiplexingPCRPrimer=GGAACTCCAGTCACN{6}TCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "Aseq=TGGCACCCGAGAATTCCA" -a "Aseq=TGGCACCCGAGAATTCCA" -a "illuminaSmallRNAAdapter=TCGTATGCCGTCTTCTGCTTGT"
The trimmed fastq files can now be mapped using a mapper of your choice. In this example we will use BWA.
bwa mem -t 4 -M {your.reference} trimmed.R1.fastq.gz trimmed.R2.fastq.gz | samtools view -b - > ./unsorted.bam; samtools sort -T ./temp_sort -@ 4 ./unsorted.bam > ./sorted.unfinished.bam;
mv ./sorted.unfinished.bam ./sorted.bam;
samtools index ./sorted.bam; rm ./unsorted.bam;
The aim of this step is to, for every fragment in sorted.bam find scCHIC seq fragments and assign them to a molecule. This can later be used for deduplication. Fragments with the same cell barcode, umi, library and strand and starting within a range of 5 bp from each other are assumed to belong to the same molecule. The Micrococcal nuclease (MNase) cut site location is expected to be between the first base (Usually an A) this A is part of the sequencing adapter, and the second base (Usually a T). The cut site location is recorded into the DS tag. When alleles are specified using -alleles, the molecule assignment is split up by allele, this means that if two fragments map to the same location and share the same umi, but contain SNPs which indicate differing alleles, the reads are not assigned to the same molecule. For every fragment the ligation sequence is recorded into the RZ tag.
bamtagmultiome.py sorted.bam -method chic -o ./tagged.bam
The MNase cuts around locations where a selected histone mark is present, preferably between A/T bases. The digested fragments are ligated to the blunt end of the adapter. The cut site is defined as the genomic mapping location of the second base in read 1. In case of mapping to the antisense strand this corresponds to the before last base of the read 1 sequence stored in the BAM file. The ligation motif is defined as the two bases flanking the MNase cut site. Read 2 is not required to assign a MNase cut site to the fragment as long as R1 contains a sequence which maps it uniquely to the reference genome used. To make sure we always obtain all fragments from a single location within reasonable memory bounds PySamIterators is used to traverse over all read pairs.
The starting location of read 2 of every fragment is the result of random priming, this means that for a single molecule there can be multiple mapping locations for R2 which are indicative of multiple RT copies from the same RNA template.
The locations of the MNase cut sites can be binned into bins of a given size. In this example 250kb The minMQ parameter defines what the lowest mapping quality for a molecule should be to be counted. The --filterXA makes filters reads which map to multiple locations (excludes alternative loci) The -sampleTags parameter defines what row identifiers the resulting table should have. If you would like to know the associated barcode you would use "BC" instead of "SM". The --dedup flag makes sure every molecule is counted only once, if this flag is not specified every fragment in the BAM file with a DS tag (Which is a valid scCHIC site) will be counted. The -joinedFeatureTags option allows each row of the count table to start with the reference_name (i.e., chromosome name).
The -bin parameter defines the size of every bin
bamToCountTable.py -minMQ 30 --filterXA ./tagged/sorted.bam -o ./count_table.csv
-joinedFeatureTags reference_name -sampleTags SM -bin 250_000 -binTag DS --dedup --r1only
A sliding window based table can be generated by adding the -sliding flag, for example for windows offset by 50k use:
bamToCountTable.py -sliding 50_000 --filterXA -minMQ 30 ./tagged/sorted.bam -o ./count_table.csv -sampleTags SM -bin 250_000 -joinedFeatureTags reference_name -binTag DS --dedup --r1only
The genomic regions can also be defined by a 4-column tab-delimited bed file by specifying --bedFile
. Each row in the file has chromosome, start, end, and name of row (e.g., gene name).
bamToCountTable.py --filterXA -minMQ 50 inbam.bam -o count_table.csv -sampleTags SM -joinedFeatureTags reference_name --dedup --r1only -bedfile genome_regions.bed