These analyses seek to repeat the processing employed in Yoann's first TNSeq paper with some minor improvements in index demultiplexing.
The original commits to this repository comprised the files:
- 01_annotation.Rmd and html
- 02_sample_estimation.Rmd and html
- 03_differential_expression.Rmd and html
The Rmd files are my worksheets and the html are the reports generated from them. In order for them to run properly, the preprocessing/ directory contains the count tables etc. while the reference/ directory contains the requisite genome and gff annotation.
Since then, I started prefixing the Rmd files with a date-based version. I therefore start with a given worksheet (20190115_01_annotation.Rmd for example) and lay down what I believe to be a suitable set of analyses. When I render it, I have knitr output an html file which is further prefixed with the run date; thus if something changes on my computer I can pick those changes up by comparing the various run dates for a single version. The outputs are therefore named something like: 20190116_20190115_01_annotation.html vs. 20190117_20190115_01_annotation.html. In this example, I ran the report on two days, Jan. 16th and 17th.
In this repository there are now therefore my previous runs and a fresh set run on 20190115. Assuming we find that something in this is incorrect or unwanted, I will make changes directly to the existing files and rerun them. If those changes are big enough to warrant a new version, I will do that.
If one wishes to simply browse the tasks I performed, open the .html files in a web browser. Those documents are the knitr outputs of my worksheets.
If one wishes to run the tasks I performed, open the .Rmd files in an appropriate R IDE. One will still wish to look at the html files in order to get the set of prerequisite packages (at the bottom of each html).
The scripts used for preprocessing were generated by CYOA (http://github.com/elsayed-lab/CYOA) and invoked on the UMIACS computing cluster.
A copy of the various scripts generated in this process is in the preprocessing/hpgl0864/scripts/ directory.
The demultiplexing process for Yoann's TNSeq samples has passed through a few iterations. In the first samples, it was written as an exact match of each index. Later experiments showed some strange behaivor with respect to the sequence leading up to the index; thus it was rewritten to take that into account and to allow for some ambiguity in sequence composition.
Here are my notes from when we implemented these changes:
- We now have ambiguous index searches which are better than the originals I performed. This takes the set of unknown sequences and uses a levinstein distance to better differentiate indexes. This looks like it will decrease the number of unsorted reads significantly.
- Later TNSeq runs showed some peculiar PCR products for which the adapter trimming process was changed to find and therefore keep more reads. This was not done in our first runs.
- We never explicitly checked that the mariner TA is actually where it is supposed to be.
The process now is therefore:
- cd into preprocessing/libs12 and preprocessing/libs34 and run cyoa to tnseq->sort_indexes. It handles the ambiguous index searching. The resulting sequences are the same, but ~ 9 nucleotides shorter.
- copy/move the resulting sorted fastq.gz files into the appropriate directories in preprocessing.
- maybe concatenate the replicate time points into time0->time3
- Perform the TNSeq adapter trimming, the new version has flags to handle the following adapter oligos: ACAGGTTGGATGATAAGTCCCCGGTCTGACACATC ACAGTCCCCGGTCTGACACATCTCCCTAT ACAGTCCNCGGTCTGACACATCTCCCTAT, this process will create files like 'l1t0-trimmed.fastq.gz'
- Run a simple TA check on the remaining read pieces to make sure that they have TAs in the appropriate places, move all reads without the appropriate TAs into something like l1t0-trimmed_nota.fastq.gz
The fourth point is perhaps the most interesting, it was included because later iterations of sequencing (starting in 2016) had many reads which explicitly have a 'N' after the ACAGTCC of the read. In addition, many reads were observed to have a strange lengthening of the precursor sequence to the index.
With that in mind, the process followed was:
- Extract the indexes used for each run by reading the file 'indexes.txt'; which defined the indexes used and the output filenames.
- Read the raw sequences and use a Levinstein distance of 1 to sort each sequence into the appropriate output file. The following is a representative summary generated by this process:
The total number of reads read in lane2_6.fastq.gz: 354553797. The total number of reads sorted in lane2_6.fastq.gz: 327044369. Index ATAGAGGC had 32959684 reads written from lane2_6.fastq.gz to outputs/hpgl1090.fastq.gz. Index CAGGACGT had 36729707 reads written from lane2_6.fastq.gz to outputs/hpgl1095.fastq.gz. Index ambiguous had 118215 reads written from lane2_6.fastq.gz to ambiguous.fastq. Index GTACTGAC had 30772599 reads written from lane2_6.fastq.gz to outputs/hpgl1096.fastq.gz. Index GGCTCTGA had 32136482 reads written from lane2_6.fastq.gz to outputs/hpgl1092.fastq.gz. Index TCCGGAGA had 34391003 reads written from lane2_6.fastq.gz to outputs/hpgl1098.fastq.gz. Index TATAGCCT had 32622342 reads written from lane2_6.fastq.gz to outputs/hpgl1089.fastq.gz. Index AGGCGAAG had 32781978 reads written from lane2_6.fastq.gz to outputs/hpgl1093.fastq.gz. Index unknown had 27391213 reads written from lane2_6.fastq.gz to uknown.fastq. Index TAATCTTA had 31565524 reads written from lane2_6.fastq.gz to outputs/hpgl1094.fastq.gz. Index CCTATCCT had 32600205 reads written from lane2_6.fastq.gz to outputs/hpgl1091.fastq.gz. Index ATTACTCG had 30484845 reads written from lane2_6.fastq.gz to outputs/hpgl1097.fastq.gz.
In the best cases, this increased the sensitivity of finding reads to ~ 94% from our previous best case of ~ 75%.
The sequences resulting from the above were next scanned by cutadapt and only those sequences with TNSeq adaptor precursors were kept. The precursor sequences were removed; resulting in sequences which all ideally end in the mariner insertion point, TA. This was then confirmed via a simple regular expression, and only sequences with the terminal TA sequences were kept in filenames named 'hpglxxxx-trimmed_ca_ta.fastq.xz' where xxxx comprises the HPGL identifier, ca for cutadapt filtered sequence, and ta for those which were checked for terminal TAs.
bowtie1 was used to align the remaining reads against the S. pyogenes 5448 genome with options to randomly place multi-hits one time. The alignments were sorted and converted to bam via samtools and counted by gene via htseq.
The cyoa function Essentiality_TAs() read the fasta genome, the alignment bam file, and the gff annotations in order to create a catalog of all possible TAs and count how many each TA was observed. This is one of the primary inputs for the DeJesus methods.
The DeJesus mh-ess and tn-hmm methods were employed. These have since been included in their TRANSIT pipeline. http://saclab.tamu.edu/essentiality/transit/, but I was slow to adopt it.
The primary caveats in these invocations are:
- The mh-ess tool was invoked 6 separate times, once each with the -m option set to 1,2,4,8,16,32. This was done to hopefully find an appropriate option for libraries at different levels of saturation.
The resulting csv files from this process were used to define the set of 'essential' genes in each sample.
The resulting count tables and csv files from the essentiality tools were passed to hpgltools http://github.com/elsayed-lab/hpgltools in order to visualize the samples and metrics and eventually perform some further analyses.
In later iterations of the methods employed in the hpgltools suite, the resulting reports are generated with a filename containing the date the analysis was performed along with the analysis revision. Since this is actually going to get published, I will rerun these analyses using this newer method. The hpgltools has also evolved to include some new and (hopefully) improved analyses; ideally these will not change the final results, but lets find out!
The file 01_annotation.Rmd comprises the methods employed to collect the appropriate S. pyogenes annotations. The methods employed were to query the microbesonline database http://microbesonline.org and to gather gene IDs from the gff file used during preprocessing.
In order to view the relationships of the samples, hpgltools was used to visualize the matrices created by the htseq-generated count tables. In this process, the count tables, annotations, and sample sheet were combined into an expressionset and queried with various normalizations and plots as outlined in 02_sample_estimation.Rmd.
The htseq count tables from the alignment were used to calculate a simplified metric of 'fitness'. This process used the hpgltools to invoke limma/deseq2/edger, combine their outputs, and look for genes observed to be significantly increased/decreased. The methods employed are outlined in 03_differential_expression.Rmd.