Merge pull request #92 from jolespin/devel

updated logo and docs
jolespin · May 11, 2024 · 8a72333 · 8a72333
2 parents b947686 + 46f60a4
commit 8a72333
Show file tree

Hide file tree

Showing 16 changed files with 629 additions and 110 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -426,7 +426,7 @@ There was a problem importing veba_output/misc/reads_table.tsv:
 
 **Definitely:**
 
-* Cluster `MicroEuk50` -> `MicroEuk30`
+* Add `BiNI` biosynthetic novelty index to `biosynthetic.py`
 * `busco_wrapper.py` that relabels all the genes, runs analysis, then converts output to tsv.
 * Script to update genome clusters
 * Script to update protein clusters

diff --git a/README.md b/README.md
@@ -15,19 +15,13 @@
 [issues-shield]: https://img.shields.io/github/issues/jolespin/veba.svg?style=for-the-badge
 [issues-url]: https://github.com/jolespin/veba/issues
 
-```
- _    _ _______ ______  _______
-  \  /  |______ |_____] |_____|
-   \/   |______ |_____] |     |
-```
+[![Schematic](images/graphical-abstract.png)](images/Schematic.pdf)
 
 ### What is VEBA? 
 The *Viral Eukaryotic Bacterial Archaeal* (VEBA) is an open-source software suite developed with all domains of microorganisms as the primary objective (not post hoc adjustments) including prokaryotic, eukaryotic, and viral organisms.  VEBA is an end-to-end metagenomics and bioprospecting software suite that can directly recover and analyze eukaryotic and viral genomes in addition to prokaryotic genomes with native support for candidate phyla radiation (CPR). VEBA implements a novel iterative binning procedure and an optional hybrid sample-specific/multi-sample framework that recovers more genomes than non-iterative methods.  To optimize the microeukaryotic gene calling and taxonomic classifications, VEBA includes a consensus microeukaryotic database containing protists and fungi compiled from several existing databases. VEBA also provides a unique clustering-based dereplication strategy allowing for sample-specific genomes and proteins to be directly compared across non-overlapping biological samples. VEBA also automates biosynthetic gene cluster identification and novelty scores for bioprospecting.
 
 VEBA's mission is to make robust (meta-)genomics/transcriptomics analysis effortless.  The philosophy of VEBA is that workflows should be modular, generalizable, and easy-to-use with minimal intermediate steps.  The approach implemented in VEBA is to (try and) think 2 steps ahead of what you may need to do and automate the task for you.
 
-[![Schematic](images/Schematic.png)](images/Schematic.pdf)
-
 <p align="right"><a href="#readme-top">^__^</a></p>
 
 ___________________________________________________________________
@@ -37,7 +31,7 @@ ___________________________________________________________________
 * Espinoza JL, Phillips A, Prentic MB, Tan GS, Kamath PL, Lloyd KG, Dupont CL. Unveiling the Microbial Realm with VEBA 2.0: A modular bioinformatics suite for end-to-end genome-resolved prokaryotic, (micro)eukaryotic, and viral multi-omics from either short- or long-read sequencing.  [BioRxiv Preprint: doi.org/10.1101/2024.03.08.583560v2](https://www.biorxiv.org/content/10.1101/2024.03.08.583560v2). In review somewhere else.
 * Espinoza JL, Dupont CL. VEBA: a modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes. BMC Bioinformatics. 2022 Oct 12;23(1):419. [doi: 10.1186/s12859-022-04973-8](https://doi.org/10.1186/s12859-022-04973-8). PMID: 36224545.
 
-Please cite the software dependencies described under the [*Dependency Citation Table*](CITATIONS.md).
+In addition to the above, please cite the software dependencies described under the [*Dependency Citation Table*](CITATIONS.md).
 
 <p align="right"><a href="#readme-top">^__^</a></p>
 

diff --git a/bin/README.md b/bin/README.md
diff --git a/images/graphical-abstract.pdf b/images/graphical-abstract.pdf
diff --git a/images/graphical-abstract.png b/images/graphical-abstract.png
diff --git a/images/graphical-abstract/Modules/assembly.md b/images/graphical-abstract/Modules/assembly.md
@@ -0,0 +1,44 @@
+
+
+```mermaid
+%%{init: { "flowchart": { "curve": "linear" } } }%%
+
+%% Available curve styles include basis, bumpX, bumpY, cardinal, catmullRom, linear, monotoneX, monotoneY, natural, step, stepAfter, and stepBefore. %%%
+
+graph LR
+
+subgraph "`**assembly**`"
+
+	%% Programs
+	METASPADES["metaSPAdes"]
+	SAMTOOLS["samtools"]
+	BOWTIE2_INDEX["bowtie2-build"]
+	BOWTIE2["bowtie2"]
+	FEATURECOUNTS["featureCounts"]
+	SEQKIT["seqkit stats"]
+
+	%% inputs
+	READS[\"cleaned_1/2.fastq.gz"/]
+
+	%% outputs
+	STATS["statistics.tsv"]
+		
+	%% FastP
+	READS --repair.sh--> METASPADES
+	METASPADES --> ASSEMBLY["scaffolds.fasta"]
+	ASSEMBLY --"fasta_to_saf.py"--> SAF["scaffolds.fasta.saf"]
+
+	%% Bowtie2
+	ASSEMBLY --> BOWTIE2_INDEX --> INDEX["scaffolds.fasta.*.bt2"]
+
+	READS & INDEX -->  BOWTIE2 --> SAMTOOLS --> BAM["mapped.sorted.bam"]
+
+	%% featureCounts
+	BAM & SAF --> FEATURECOUNTS --> COUNTS["counts.tsv"]
+
+	ASSEMBLY --> SEQKIT --> STATS
+
+end
+
+
+```
diff --git a/images/graphical-abstract/Modules/binning-prokaryotic.md b/images/graphical-abstract/Modules/binning-prokaryotic.md
@@ -0,0 +1,65 @@
+
+
+```mermaid
+%%{init: { "flowchart": { "curve": "linear" } } }%%
+
+%% Available curve styles include basis, bumpX, bumpY, cardinal, catmullRom, linear, monotoneX, monotoneY, natural, step, stepAfter, and stepBefore. %%%
+
+graph TD
+
+
+	%% Programs
+	COVERM["coverm"]
+	PYRODIGAL["Pyrodigal"]
+	METABAT2["Metabat2"]
+	MAXBIN2_107["MaxBin2(MarkerSet=107)"]
+	MAXBIN2_40["MaxBin2(MarkerSet=40)"]
+	CONCOCT["CONCOCT"]
+	DASTOOL["DAS_Tool"]
+	TIARA["Tiara"]
+	CHECKM2["CheckM2"]
+	BARRNAP["barrnap"]
+	TRNASCANSE["tRNAscan-SE"]
+	FEATURECOUNTS["featureCounts"]
+	SEQKIT["seqkit stats"]
+
+	%% inputs
+	ASSEMBLY["scaffolds.fasta"]
+	BAM["mapped.sorted.bam"]
+
+	%% outputs
+	STATS["statistics.tsv"]
+		
+	BAM --> COVERM --> COVERAGE["coverage.tsv"]
+	ASSEMBLY --> PYRODIGAL --> PROTEINS["proteins.fasta"] & CDS["cds.fasta"] & GFF["gene_models.gff"]
+
+subgraph "`**_N_ iterative binning-prokaryotic**`"
+
+	ASSEMBLY & COVERAGE --> METABAT2 --> MAGS_METABAT["MAGs<SUB>Metabat2</SUB>"]
+	ASSEMBLY & COVERAGE --> MAXBIN2_107 --> MAGS_MAXBIN2_107["MAGs<SUB>MaxBin2_107</SUB>"]
+	ASSEMBLY & COVERAGE --> MAXBIN2_40 --> MAGS_MAXBIN2_40["MAGs<SUB>MaxBin2_40</SUB>"]
+	ASSEMBLY & COVERAGE --> CONCOCT --> MAGS_CONCOCT["MAGs<SUB>CONCOCT</SUB>"]
+
+	MAGS_MAXBIN2_107  & MAGS_MAXBIN2_40 & MAGS_CONCOCT & PROTEINS --> DASTOOL
+	
+	DASTOOL --> CANDIDATE_MAGS["MAGs<SUB>Candidate</SUB>"]
+
+	CANDIDATE_MAGS --> TIARA
+	TIARA --> MAGS_P["MAGs<SUB>Prokaryotic</SUB>"]
+	TIARA --x MAGS_E["MAGs<SUB>Eukaryotic</SUB>"]
+
+	MAGS_P & PROTEINS --> CHECKM2
+
+	CHECKM2 --> MAGS_PASSED["MAGs<SUB>Passed</SUB>"]
+	CHECKM2 --x MAGS_FAILED["MAGs<SUB>Failed</SUB>"] --> UNBINNED["unbinned.fasta"] --> BEGINNING["Repeat with unbinned.fasta"]
+
+
+end
+
+MAGS_PASSED --> BARRNAP --> RRNA["MAGS.rRNA.fasta"]
+MAGS_PASSED --> TRNASCANSE --> TRNA["MAGS.TRNA.fasta"]
+
+MAGS_PASSED & CDS & RRNA & TRNA --> SEQKIT --> STATS
+
+
+```
diff --git a/images/graphical-abstract/Modules/mermaid_test.md b/images/graphical-abstract/Modules/mermaid_test.md
@@ -0,0 +1,114 @@
+
+
+```mermaid
+%%{init: { "flowchart": { "curve": "linear" } } }%%
+
+%% Available curve styles include basis, bumpX, bumpY, cardinal, catmullRom, linear, monotoneX, monotoneY, natural, step, stepAfter, and stepBefore. %%%
+
+graph TD
+subgraph "`**preprocessing**`"
+	%% modules
+	PREPROCESS_SHORT(["`_preprocess-short_`"])
+	PREPROCESS_LONG(["`_preprocess-long_`"])
+	
+	%% inputs
+	R1[\"Illumina_1.fastq.gz"/]
+	R2[\"Illumina_2.fastq.gz"/]
+	LONG[\"ONT|PacBio.fastq.gz"/]
+	
+	
+	%% databases
+	CONTAMINATION[(Contamination)]
+	KMER[(K-mer Profiles)] 
+	
+	%% ---
+	
+	
+	%% preprocess/-long
+	R1 & R2 --> PREPROCESS_SHORT
+	CONTAMINATION -.-> PREPROCESS_SHORT
+	KMER -.-> PREPROCESS_SHORT
+	
+	LONG --> PREPROCESS_LONG
+	CONTAMINATION -.-> PREPROCESS_LONG
+	KMER -.-> PREPROCESS_LONG
+end
+
+subgraph "`**assembly**`"
+	%%inputs 
+	ASSEMBLY(["`_assembly|assembly-long_`"])
+	
+	%% outputs
+	ASSEMBLY_FASTA[["assembly.fasta"]]
+	BAM[["mapped.sorted.bam"]]
+	
+	%% assembly/-long
+	PREPROCESS_SHORT --cleaned_1/2.fastq.gz--> ASSEMBLY
+	PREPROCESS_LONG --cleaned.fastq.gz--> ASSEMBLY
+	ASSEMBLY --> ASSEMBLY_FASTA & BAM
+end
+
+%% -- 
+
+subgraph "`**binning**`"
+	%% modules
+	BINNING_VIRAL(["`_binning-viral_`"])
+	BINNING_PROKARYOTIC(["`_binning-prokaryotic_`"])
+	BINNING_EUKARYOTIC(["`_binning-eukaryotic_`"])
+	
+	
+	%% outputs
+	GENOMES_AND_GENE_MODELS("Genomes & Gene Models")
+	GENOMES[["Genomes"]]
+	GENE_MODELS[["Gene Models"]]
+	
+	%% databases
+	%%CHECKV[("CheckV")]--> BINNING_VIRAL
+	%%GENOMAD[("geNomad")]--> BINNING_VIRAL
+	
+	%% --
+	%% binning-viral
+	ASSEMBLY_FASTA & BAM --> BINNING_VIRAL
+	
+	%% binning-prokaryotic 
+	BINNING_VIRAL --unbinned.fasta--> BINNING_PROKARYOTIC
+	BAM --> BINNING_PROKARYOTIC
+	
+	%% binning-eukaryotic
+	BINNING_PROKARYOTIC --unbinned.fasta--> BINNING_EUKARYOTIC
+	BAM --> BINNING_EUKARYOTIC
+	
+	%% coverage 
+	%% COVERAGE("coverage|coverage-long") 
+	
+	BINNING_VIRAL & BINNING_PROKARYOTIC & BINNING_EUKARYOTIC --"genome-resolved"--> GENOMES_AND_GENE_MODELS
+	GENOMES_AND_GENE_MODELS --> GENOMES & GENE_MODELS
+
+
+end
+
+%% --
+
+subgraph "`**clustering**`"
+	%% modules
+	CLUSTER("`_cluster_`")
+	
+	%% output
+	PROTEIN_CLUSTERS[["SLC-specific Protein Clusters (SSPC)"]]
+	GENOME_CLUSTERS[["Species-level Clusters (SLC)"]]
+	
+	
+	%% cluster
+	GENOMES & GENE_MODELS--> CLUSTER
+	CLUSTER --> GENOME_CLUSTERS
+	CLUSTER --> PROTEIN_CLUSTERS
+
+end
+
+subgraph "`**annotation**`"
+ANNOTATE("`_annotate_`")
+
+GENE_MODELS & PROTEIN_CLUSTERS  --> ANNOTATE
+end
+
+```
diff --git a/images/graphical-abstract/Modules/preprocess.md b/images/graphical-abstract/Modules/preprocess.md
@@ -0,0 +1,53 @@
+
+
+```mermaid
+%%{init: { "flowchart": { "curve": "linear" } } }%%
+
+%% Available curve styles include basis, bumpX, bumpY, cardinal, catmullRom, linear, monotoneX, monotoneY, natural, step, stepAfter, and stepBefore. %%%
+
+graph LR
+
+subgraph "`**preprocess**`"
+
+	%% Programs
+	FASTP["FastP"]
+	BOWTIE2["Bowtie2"]
+	SEQKIT["seqkit stats"]
+	BBDUK["BBDuk"]
+
+	%% Databases
+	CONTAMINATION[("Contamination")]
+	KMERS[("K-mer Profiles")]
+
+	%% inputs
+	READS[\"Illumina_1/2.fastq.gz"/]
+
+	%% outputs
+	STATS["statistics.tsv"]
+		
+	%% FastP
+	READS --> FASTP
+
+	FASTP --"trimmed_1/2.fastq.gz"--> BOWTIE2
+
+	%% Bowtie2
+	CONTAMINATION --> BOWTIE2
+	BOWTIE2 --"cleaned_1/2.fastq.gz"--> BBDUK
+	BOWTIE2 --"contaminated_1/2.fastq.gz"--> STATS
+
+	%%BBDuk	
+	KMERS --> BBDUK
+
+	READS --> SEQKIT
+	BOWTIE2 --> SEQKIT
+	BBDUK --"cleaned_1/2.non-kmer_hits.fastq.gz"--> SEQKIT
+	BBDUK --"cleaned_1/2.kmer_hits.fastq.gz"--> SEQKIT
+
+	SEQKIT --> STATS
+
+
+	
+end
+
+
+```
diff --git a/images/graphical-abstract/graphical-abstract-youtube-logo.png b/images/graphical-abstract/graphical-abstract-youtube-logo.png
diff --git a/images/graphical-abstract/graphical-abstract-youtube.png b/images/graphical-abstract/graphical-abstract-youtube.png
diff --git a/images/graphical-abstract/graphical-abstract-youtube.pptx b/images/graphical-abstract/graphical-abstract-youtube.pptx
diff --git a/images/graphical-abstract/graphical-abstract.pdf b/images/graphical-abstract/graphical-abstract.pdf
diff --git a/images/graphical-abstract/graphical-abstract.png b/images/graphical-abstract/graphical-abstract.png
diff --git a/images/graphical-abstract/graphical-abstract.pptx b/images/graphical-abstract/graphical-abstract.pptx
diff --git a/install/DATABASE.md b/install/DATABASE.md
@@ -228,7 +228,7 @@ VEBA’s Microeukaryotic Protein Database has been completely redesigned using t
 **Deprecated:**
 
 <details>
-	<summary> VEBA Database* version: VDB_v5.2 (243 GB) </summary>
+	<summary> VEBA Database version: VDB_v5.2 (243 GB) </summary>
 
 *  Added `MicrobeAnnotator-KEGG` [Zenodo: 10020074](https://zenodo.org/records/10020074) which includes KEGG module pathway information from [`MicrobeAnnotator`](https://doi.org/10.1186/s12859-020-03940-5).
 *  Added `CAZy` protein sequences from [`dbCAN2`](https://academic.oup.com/nar/article/46/W1/W95/4996582)
@@ -823,7 +823,7 @@ tree -L 3 .
 
 
 <details>
-	<summary>VEBA Database* version: VDB_v3.1</summary>
+	<summary>VEBA Database version: VDB_v3.1</summary>
 
 The same as `VDB_v3` but updates `VDB-Microeukaryotic_v2` to `VDB-Microeukaryotic_v2.1` which has a `reference.eukaryota_odb10.list` containing only the subset of identifiers that core eukaryotic markers (useful for classification).
 
@@ -933,7 +933,7 @@ tree -L 3 .
 
 
 <details>
-	<summary>VEBA Database* version: VDB_v3</summary>
+	<summary>VEBA Database version: VDB_v3</summary>
 
 ```
 tree -L 3 .
@@ -1031,7 +1031,7 @@ tree -L 3 .
 
 
 <details>
-	<summary>VEBA Database* version: VDB_v2</summary>
+	<summary>VEBA Database version: VDB_v2</summary>
 
 * Compatible with *VEBA* version: `v1.0.2a+`