From 756eb06ad3edfd0366952ae9508fdd6745f46af3 Mon Sep 17 00:00:00 2001 From: Will Nickols <78048944+WillNickols@users.noreply.github.com> Date: Thu, 14 Dec 2023 12:52:35 -0500 Subject: [PATCH] Update README.md --- README.md | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 102 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index ec2328f..a9993e5 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ # Installation -The workflow can be installed with the following commands. It is okay if the Checkm2 database install throws an error starting with `File "checkm2/bin/checkm2", line 244, in ...`. The PhyloPhlAn run will intentionally fail in order to download the database but will take a while to download the database. +The workflow can be installed with the following commands. It is okay if the Checkm2 database install throws an error starting with `File "checkm2", line 244, in ...`. The PhyloPhlAn run will intentionally fail in order to download the database but will take a while to download the database. ``` git clone https://github.com/WillNickols/assembly_workflow cd assembly_workflow @@ -23,15 +23,17 @@ export CHECKM_DATA_PATH=$(pwd)/databases/checkm/ export PHYLOPHLAN_PATH=$(pwd)/databases/phylophlan/ ``` -# Example runs +# Tutorial +## Example download The example files can be downloaded and unzipped using the following commands: ``` wget http://huttenhower.sph.harvard.edu/biobakery_demo/biobakery_workflows/assembly/example.tar.gz tar -xf example.tar.gz ``` -This runs the workflow on a set of paired-end files that have already been cleaned with Kneaddata. The `by_sample` flag specifies that abundance estimates of the sample's taxa should come from aligning that sample's reads only to its contigs. Alternatively, the `by_dataset` flag can be used to align a sample's reads against all contigs generated from the input folder when the samples likely share the same taxa. +## Example 1 +The following command runs the workflow on a set of paired-end fastq files that have already been cleaned with Kneaddata. The `by_sample` flag specifies that abundance estimates of the sample's taxa should come from aligning that sample's reads only to its contigs. Alternatively, the `by_dataset` flag can be used to align a sample's reads against all contigs generated from the input folder. Using the `by_dataset` flag can be useful when the samples likely share the same taxa but not every sample will assemble the shared genomes. ``` python assembly_workflow.py \ -i example/input/sample_1/ \ @@ -45,9 +47,44 @@ python assembly_workflow.py \ --remove-intermediate-output ``` -The output, `example/output/sample_1/final_profile_by_sample.tsv` is a MetaPhlAn-like output with the abundance of each taxonomic group identified in the sample. The output shows that Pseudoalteromonas marina is present in the sample along with a species, `sgb_01` that represents a new species genome bin not within a Mash distance of 0.05 of any known SGB, 0.15 of any known GGB, or 0.3 of any known FGB. +### Abundance output +The output, `example/output/sample_1/final_profile_by_sample.tsv` is a MetaPhlAn-like output with the abundance of each taxonomic group identified in the sample. We can inspect it with: -This runs the workflow on a concatenated file with pre-created contigs in `example/output/sample_2/assembly/main/sample_2/sample_2.contigs.fa`. Such a file could be created from the `biobakery_workflows wmgx` workflow with `--run-assembly`. Note that the original reads are still required for the abundance calculation. Additionally, grid options and scratch space information can be provided as seen below. +``` +head example/output/sample_1/final_profile_by_sample.tsv +``` + +### Phylogenetic output +The output shows that Pseudoalteromonas marina is present in the sample along with the assembled genome `sgb_01` representing a new species genome bin (SGB). We can check the PhyloPhlAn placement of this new SGB by examining `example/output/sample_1/main/phylophlan/phlophlan_relab.tsv`: +``` +head example/output/sample_1/main/phylophlan/phlophlan_relab.tsv +``` +
+Given that this SGB was marked as unclassified, how far away should the closest SGB, GGB, and FGB be? +
+The closest SGB, GGB, and FGB should have a Mash distance of more than 0.05, 0.15, and 0.3 respectively. +
+ +### Quality output +We can also check the Checkm2 report to determine the quality of these bins: +``` +head example/output/sample_1/main/checkm/qa/checkm_qa_and_n50.tsv +``` +This file will contain the quality metrics on all assembled bins regardless of their quality. However, the bins used for the final abundance table are only the medium- and high-quality bins. + +### Genome bins +Using the output file tree below, see if you can find the assembled genome bins from `sample_1`. +
+ + +
+ +The bins are in `example/output/sample_1/main/bins/sample_1/bins/`. + +
+ +## Example 2 +We might want to create genome bins after running a standard biobakery workflow. In this case, we can run the SGB workflow on pre-created contigs such as from the `biobakery_workflows wmgx` workflow with the `--run-assembly` flag. Here, we'll start from the contigs in `example/output/sample_2/assembly/main/sample_2/sample_2.contigs.fa`. Note that the original read files are still required since we need to perform alignment for the abundance calculation. ``` python assembly_workflow.py \ -i example/input/sample_2/ \ @@ -61,6 +98,66 @@ python assembly_workflow.py \ --remove-intermediate-output ``` +## Example 3 +In the `tutorial` folder of this GitHub, the `tutorial/animal_guts_profile.tsv` file is an example output from a set of 62 diverse animal stool samples. + +### New SGBs +
+ How many new SGBs were created? +
+ +We can check the number of times 'sgb' appears in the first column: +``` +awk -F'\t' '$1 ~ /sgb/ {count++} END {print count}' tutorial/animal_guts_profile.tsv +``` +We see there were 93 new SGBs. + +
+ +We can also see that some new SGBs show up in multiple samples: +``` +awk -F'\t' '$1 == "\"sgb_89\"" {print}' tutorial/animal_guts_profile.tsv +``` +Samples 2 and 4 had MAGs that were close enough that they were merged into the same novel SGB. In fact, both of these samples came from the same fin whale. + +### Proportion unknown +Finally, we can visualize how much of each sample's abundance is made of known microbes, new SGBs, and unknown microbes. The following script will produce a `figures` folder in the `tutorial` folder, from which you can examine the unknown abundance. +``` +Rscript abundance_script.R +``` +We can see that the vast majority of most samples consists of unknown genetic material. Patially, this is due to the fact that wild animal guts are not very well characterized, but it is also due to the fact that assembly methods tend to have low recall. + +# Output file tree + +The folder specified by `-o` will have the following important files: +``` +- anadama.log (log of commands and outputs) +- final_profile_by_[sample/dataset].tsv (MetaPhlAn-like abundance table) +- main/ + - abundance_by_[sample/dataset]/ + - [sample_name].abundance.tsv (abundance estimates of MAGs in this sample) + - [sample_name].coverage.tsv (per-congig coverage in this sample) + - [sample_name].mapped_read_num.txt (number of reads mapping to contigs in this sample) + - [sample_name].total_read_num.txt (total reads in this sample) + - assembly/ + - main/ + - [sample_name]/ + - [sample_name].final.contigs.fa (fasta file of contigs for this sample) + - bins/ + - [sample_name]/ + - bins/ + - [sample_name].bin.[bin number].fa (one MAG from this sample) + - checkm/ + - qa/ + - checkm_qa_and_n50.tsv (Checkm2 quality information for each MAG) + - phylophlan/ + - phylophlan_relab.tsv (PhyloPhlAn taxonomic information for each MAG) + - sgbs/ (for MAGs not assigned by PhyloPhlAn) + - sgbs/ + - SGB_info.tsv (Information on which bins are in which SGBs and which bin represents the SGB) + - sgb_[SGB number].fa (SGB representative genome) +``` + # Extra checks This command runs the workflow without taxonomically placing the MAGs (it runs only assembly, binning, and quality checking).