-
Notifications
You must be signed in to change notification settings - Fork 10
scartrace data processing
- Go to the folder containing the raw reads. Then run:
scmo_workflow.py scartrace
This creates the Snakefile containing all the required steps and a config file.
-
Edit the config file (config.json), make sure to set the right path to a reference Fasta file and optionally pick a different mapper or bin sizes.
-
Execute the workflow locally by running
snakemake
or run on a cluster by using of the commands shown byscmo_workflow.py
. Important: when running from a conda environment, do not forget to use --use-conda!Example submission using submission.py and a conda environment:
submission.py "source activate conda;snakemake --use-conda --cluster slurm_wrapper.py --jobs 20 --restart-times 3" -y -time 50 -m 2 -sched slurm -N scar_snakepipe
The workflow code can be found here
This pipeline contains the following steps:
- Demultiplexing.
- Adapter trimming using cutadapt.
- Mapping using BWA or bowtie2.
- Tagging the bam file with relevant information - e.g. allele information, read start position, and scar information.
- Filtering the bam file based on mean base quality (stored in the SQ tag).
- Generating library quality statistics plots.
- Generating count tables of filtered and unfiltered data.
Demultiplexing adds UMI, cell, sample and sequencing index information to the header of the FastQ file. This information is later used to figure out which reads belong to the same molecule and to what cell every molecule belongs. The demultiplexer uses barcodes available in the barcodes folder. The right set of barcodes is automatically detected if -use is not supplied. The demultiplexer also trims the random primer from read 2 and stores it as meta-information, this information is later used for IVT deduplication.
Assume you are located in a directory with fastq files:
example_HYLVHBGX5_S7_L001_R1_001.fastq.gz
example_HYLVHBGX5_S7_L001_R2_001.fastq.gz
example_HYLVHBGX5_S7_L002_R1_001.fastq.gz
example_HYLVHBGX5_S7_L002_R2_001.fastq.gz
example_HYLVHBGX5_S7_L003_R1_001.fastq.gz
example_HYLVHBGX5_S7_L003_R2_001.fastq.gz
example_HYLVHBGX5_S7_L004_R1_001.fastq.gz
To then autodetect the right barcodes and demultiplex run:
demux.py *.gz --y -merge _
It can be useful to additionally supply -hd 1
, which will assign fragments with a single differing base to the closest cell barcode. This gains 5~20% of demultiplexed reads. The effect can be assessed by running with -hd 1
but not supplying the --y
, this will perform a dry run on a small subset of reads.
This will create a folder ./raw_demultiplexed Inside the folder there will be 4 files:
demultiplexedR1.fastq.gz # accepted and demultiplexed R1 reads
demultiplexedR2.fastq.gz # accepted and demultiplexed R2 reads
rejectsR1.fastq.gz # rejected R1 reads
rejectsR2.fastq.gz # rejected R2 reads
removes residual primers, this improves mapping performance:
cutadapt -o trimmed.R1.fastq.gz -p trimmed.R2.fastq.gz demultiplexedR1.fastq.gz demultiplexedR2.fastq.gz -m 3 -a "IlluminaSmallAdapterConcatBait=GGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTT" -a "IlluminaIndexAdapter=GGAATTCTCGGGTGCCAAGGAACTCCAGTCACN{6}ATCTCGTATGCCGTCTTCTGCTTG" -A "IlluminaPairedEndPCRPrimer2.0=AGATCGGAAGAGCGN{6}CAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "universalPrimer=GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT;min_overlap=5" -a "IlluminaGEX=TTTTTAATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGACGATC;min_overlap=5" -a "IlluminaMultiplexingPCRPrimer=GGAACTCCAGTCACN{6}TCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "Aseq=TGGCACCCGAGAATTCCA" -a "Aseq=TGGCACCCGAGAATTCCA" -a "illuminaSmallRNAAdapter=TCGTATGCCGTCTTCTGCTTGT"
The trimmed fastq files can now be mapped using a mapper of your choice. In this example we will use BWA. We have tested mapping with bwa and bowtie2 and although both are possible, bowtie2 does not perform as well as bwa mapping this data. Bwa is also better adapted to mapping reads with indels. Importantly, when we map with bwa, we add the -I flag, defining the expected insert size of our amplicon. Our amplicons are 221/222bp, so we set -I to 220. This prevents a lot of read pairs to map far apart, even though we map to repetetive amplicons. If we do not use this, we see a lot of read pairs where R1 and R2 map to different amplicons.
bwa mem -t 4 -I 220 -M {your.reference} trimmed.R1.fastq.gz trimmed.R2.fastq.gz | samtools view -Sb - > ./unsorted.bam; samtools sort -T ./temp_sort -@ 4 ./unsorted.bam > ./sorted.unfinished.bam;
mv ./sorted.unfinished.bam ./sorted.bam;
samtools index ./sorted.bam; rm ./unsorted.bam;
Example using bowtie2:
bowtie2 -p 4 q --no-unal --local --sensitive-local -N 1 -x {your.reference} {input.r1} {input.r2} | samtools view -Sb - > ./unsorted.bam; samtools sort -T ./temp_sort -@ 4 ./unsorted.bam > ./sorted.unfinished.bam;
mv ./sorted.unfinished.bam ./sorted.bam;
samtools index ./sorted.bam; rm ./unsorted.bam;
The mapped bam file is tagged using bamtagmultiome.py -method scartrace. This places all information that was placed in the read name for alignment back into the bamfile as tags. The option '-method scartrace' adds a tag describing the misalignment (scar) to the read, or states 'WT' when the read maps perfectly. This simultaneously also gives a mean read quality based on the average of phred scores for each base. The option '-alleles' requires a .vcf input file and adds allelic information in a tag. Which alleles should be used also have to be specified using the -allele_samples tag. The following tags are added with this script:
- 'DA' - stores allele information
- 'SD' - stores scar information
- 'DS' - stores read start position (site)
- 'SQ' - stores mean base quality
Example:
bamtagmultiome.py -method scartrace -allele_samples C57BL_6NJ,129S1_SvImJ -alleles {path_to_vcf} {input.bam} -o tagged.bam --every_fragment_as_molecule -scartrace_r1_primers CCTTGAACTTCTGGTTGTAG
Two quality filters are included in the pipeline: one is based on the read quality, the other on the maximum allowed insert size. First, the bam file is filtered based on the mean base quality ('SQ') tag - only reads with a mean base quality above the threshold you set will be stored and used for generating the count table. Everything below this threshold is considered to be noise. By default, the threshold is set at >0.98.
Example:
bamFilter.py {input.bam} -o {output.bam} 'r.has_tag("SQ") and r.get_tag("SQ") > 0.98'
Second, the read is filtered based on the maximum insert size. Since we amplify and map to repetetive regions, it is possible that R1 is mapped to amplicon 1 and R2 to amplicon 2 (or 4 or 7). We already try to reduce this from happening as much as we can by setting specific mapping parameters. This works in most cases, but not all. To filter out any remaining noise created by read pairs not mapping to the same amplicon, we throw out any read pairs with an insert size >1000 (default). You can set the maximum insert size in the config file.
Example:
bamFilter.py {input.bam} -o {output.bam} 'r.has_tag("fS") and (r.get_tag("fS") < 1000)'
Read the Library-statistics-plots page for information about library quality statistics.
The mapped and tagged bam file is converted to a count table using bamToCountTable.py. It takes '-joinedFeatureTags' as columns and '-sampleTags' as rows for this count table. For columns, we use the 'SM' tag (samplename). For rows, we use the following tags: 'chrom' (chromosome), 'DA' (allele), 'DS' (site), 'SD' (scar). This is done for both the SQ filtered and unfiltered bam files.
Example:
bamToCountTable.py input.bam -o count_table.csv -joinedFeatureTags chrom,DA,DS,SD -sampleTags SM