From 9063c547de32d12df492cde26044c8f1584aa0df Mon Sep 17 00:00:00 2001 From: Ryan Wick Date: Sat, 30 Jul 2016 15:09:26 +1000 Subject: [PATCH] Markdown formatting fixes --- README.md | 53 ++++++++++++++++++++++++++++++++--------------------- 1 file changed, 32 insertions(+), 21 deletions(-) diff --git a/README.md b/README.md index 5d27295..1b5ce1d 100644 --- a/README.md +++ b/README.md @@ -81,7 +81,7 @@ Updates in v0.2.0 1. Some improvements to allele calling, particularly for Klebsiella MLST locus mdh, kindly contributed by andreyto. Includes rejection of read alignments that are clipped on both ends (likely to be spurious) and minor bug fixes associated with depth calculations. 2. Updated E. coli serotype database to remove duplicate sequences. -3. Added mcr-2 colistin resistance gene to ARGannot.r1.fasta resistance gene database. +3. Added mcr-2 colistin resistance gene to `ARGannot.r1.fasta` resistance gene database. 4. A `--threads` option was added, which makes SRST2 call Bowtie and Samtools with their threading options. The resulting speed up is mostly due to the Bowtie mapping step which parallelises very well. 5. The `VFDB_cdhit_to_csv.py` script was updated to work with the new VFDB FASTA format. 6. Versions of Bowtie2 up to 2.2.9 are now supported. Samtools v1.3 can now be used as well, however v0.1.18 is still the recommended version (for reasons discussed below). @@ -94,8 +94,8 @@ Updates in v0.1.8 1. /data directory includes files for subtyping of the LEE pathogenicity island of E. coli, as per [Ingle et al, 2016, Nature Microbiology](http://www.nature.com/articles/nmicrobiol201510). [Instructions below](https://github.com/katholt/srst2#typing-the-lee-pathogenicity-island-of-e-coli) 2. Resistance gene database updates: - * Fixed ARGannot.r1.fasta to include proper mcr1 DNA sequence. - * Added columns to the ARGannot_clustered80.csv table, to indicate classes of beta-lactamases included in the ARGannot.r1.fasta database according to the [NCBI beta-lactamase resource](http://www.ncbi.nlm.nih.gov/pathogens/beta-lactamase-data-resources/) (new location for the Lahey list). + * Fixed `ARGannot.r1.fasta` to include proper mcr1 DNA sequence. + * Added columns to the `ARGannot_clustered80.csv` table, to indicate classes of beta-lactamases included in the `ARGannot.r1.fasta` database according to the [NCBI beta-lactamase resource](http://www.ncbi.nlm.nih.gov/pathogens/beta-lactamase-data-resources/) (new location for the Lahey list). 3. Fixed some issues with handling of missing data (i.e. where there were no hits to MLST and/or no hits to genes) when compiling results into a table via `--prev_output`. This could result in misalignment of gene columns in previous versions. ----------- @@ -106,7 +106,7 @@ Updates in v0.1.7 * SRST2_SAMTOOLS * SRST2_BOWTIE2 * SRST2_BOWTIE2_BUILD -2. Added mcr1, the plasmid-borne colisting resistance gene to the included ARG-Annot-based resistance gene DB (ARGannot.r1.fasta) +2. Added mcr1, the plasmid-borne colisting resistance gene to the included ARG-Annot-based resistance gene DB (`ARGannot.r1.fasta`) 3. Fixed a problem with writing consensus files that occurred when a directory structure was specified using `--output` (bug introduced in v0.1.6) ----------- @@ -116,7 +116,7 @@ Updates in v0.1.6 1. The original validation of SRST2 (see [paper](http://genomemedicine.com/content/6/11/90)) was performed with bowtie2 version 2.1.0 and samtools v0.1.18. * bowtie2: SRST2 has now been tested on the tutorial example and other test data sets using the latest versions of bowtie2, 2.2.3 and 2.2.4, which gave identical results to those obtained with bowtie2 v2.1.0. Therefore, the SRST2 code will now run if any of these versions of bowtie2 are available: 2.1.0, 2.2.3 or 2.2.4. * samtools: SRST2 has now been tested on the Staph & Salmonella test data sets used in the paper, and will work with newer samtools versions (tested up to v1.1). Note however that SRST2 still works best with [samtools v0.1.18](https://sourceforge.net/projects/samtools/files/samtools/0.1.18/), due to small changes in the mapping algorithms in later versions that result in some loss of reads at the ends of alleles. This has most impact at low read depths, however we do recommend using v0.1.18 for optimum results. -2. Minor fixes to the ARG-Annot database of resistance genes, including removal of duplicate sequences and fixes to gene names (thanks to Wan Yu for this). Old version remains unchanged for backwards compatibility, but we recommend using the revised version (located in data/ARGannot.r1.fasta). +2. Minor fixes to the ARG-Annot database of resistance genes, including removal of duplicate sequences and fixes to gene names (thanks to Wan Yu for this). Old version remains unchanged for backwards compatibility, but we recommend using the revised version (located in `data/ARGannot.r1.fasta`). 3. Added EcOH database for serotyping E. coli (thanks to Danielle Ingle for this). See [Using the EcOH database for serotyping E. coli with SRST2](https://github.com/katholt/srst2#using-the-ecoh-database-for-serotyping-e-coli-with-srst2) and [this BioRxiv paper](http://biorxiv.org/content/early/2015/11/18/032151). 4. Fixed a problem where, when analysing multiple read sets in one SRST2 call against a gene database in which cluster ids don't match gene symbols, individual gene clusters appear multiple times in the output. The compile function was unaffected and remains unchanged. 5. Fixed behaviour so that including directory paths in `--output` parameter works (thanks to nyunyun for contributing most of this fix). E.g. `--output test_dir/test` will create output files prefixed with `test`, located in `test_dir/`, and all SRST2 functions should work correctly including consensus allele calling. If `test_dir/` doesn't exist, we attempt to create it; if this is not possible the user is alerted and SRST2 stops. @@ -155,7 +155,7 @@ The highest such value for each gene/cluster/locus is reported in the fullgenes The highest such value across all MLST loci is reported in the mlst output table. Note that all compiled reports will now include a maxMAF column; if you provide MLST or compiled reports from previous versions without this columns, the value “NC” will be inserted in the maxMAF column to indicate “not calculated”. This ensures the updated SRST2 (v0.1.4+) is backwards compatible with previous SRST2 outputs; do be aware though that the older versions of SRST2 ( rawseqs_cdhit90.stdout +``` +cdhit-est -i rawseqs.fasta -o rawseqs_cdhit90 -d 0 > rawseqs_cdhit90.stdout +``` 2 - Parse the cluster output and tabulate the results, check for inconsistencies between gene names and the sequence clusters, and generate individual fasta files for each cluster to facilitate further checking: @@ -787,9 +794,11 @@ For comparing gene names to cluster assignments, this script assumes very basic 3 - Convert the resulting csv table to a sequence database using: - csv_to_gene_db.py -t rawseqs_clustered.csv -o seqs_clustered.fasta -f rawseqs.fasta -c 4 +``` +csv_to_gene_db.py -t rawseqs_clustered.csv -o seqs_clustered.fasta -f rawseqs.fasta -c 4 +``` -The output file, seqs_clustered.fasta, should now be ready to use with srst2 (`--gene_db seqs_clustered.fasta`). +The output file, `seqs_clustered.fasta`, should now be ready to use with srst2 (`--gene_db seqs_clustered.fasta`). If there are potential inconsistencies detected at step 2 above (e.g. multiple clusters for the same gene, or different gene names within the same cluster), you may like to investigate further and change some of the cluster assignments or cluster names. You may find it useful to generate neighbour joining trees for each cluster that contains >2 genes, using align_plot_tree_min3.py @@ -798,7 +807,9 @@ A preliminary set of resistance genes is in the /data directory of srst2, this i An easy way to add sequences to this database would be to add new rows to the table, and then generate an updated fasta file using: - csv_to_gene_db.py -t rawseqs_clustered.csv -o seqs_clustered.fasta -s rawseqs.fasta -c 5 +``` +csv_to_gene_db.py -t rawseqs_clustered.csv -o seqs_clustered.fasta -s rawseqs.fasta -c 5 +``` ### Using the VFDB Virulence Factor Database with SRST2 @@ -806,7 +817,7 @@ The VFDB houses sets of virulence genes for a range of bacterial genera, see htt To type these virulence genes using SRST2, download the full set of sequences from the VFDB website (http://www.mgc.ac.cn/VFs/Down/VFDB_setB_nt.fas.gz) and follow these steps to generate SRST2-compatible files for your genus of interest. -1 - Extract virulence genes by genus from the main VFDB file, CP_VFs.ffn: +1 - Extract virulence genes by genus from the main VFDB file, `CP_VFs.ffn`: ``` python VFDBgenus.py --infile CP_VFs.ffn --genus Clostridium @@ -836,13 +847,13 @@ python VFDB_cdhit_to_csv.py --cluster_file Clostridium_cdhit90.clstr --infile Cl python csv_to_gene_db.py -t Clostridium_cdhit90.csv -o Clostridium_VF_clustered.fasta -s 5 ``` -The output file, Clostridium_VF_clustered.fasta, should now be ready to use with srst2 (`--gene_db Clostridium_VF_clustered.fasta`). +The output file, `Clostridium_VF_clustered.fasta`, should now be ready to use with srst2 (`--gene_db Clostridium_VF_clustered.fasta`). ### Using the EcOH database for serotyping E. coli with SRST2 Details can be found in [this BioRxiv paper](http://biorxiv.org/content/early/2015/11/18/032151). -The EcOH database includes genes for identifying O and H types in E. coli, see /data/EcOH.fasta +The EcOH database includes genes for identifying O and H types in E. coli, see `/data/EcOH.fasta` O types are represented by the presence of two loci (either wzy and wzy, or wzm and wzt). Note that allelic variation is possible but does not impact serotype in a predictable way, so typing calls should be made based on the presence of genes rather than allele assignments (i.e. it is generally safe to ignore *? characters)