Skip to content

Commit

Permalink
Merge pull request #92 from jolespin/devel
Browse files Browse the repository at this point in the history
updated logo and docs
  • Loading branch information
jolespin authored May 11, 2024
2 parents b947686 + 46f60a4 commit 8a72333
Show file tree
Hide file tree
Showing 16 changed files with 629 additions and 110 deletions.
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -426,7 +426,7 @@ There was a problem importing veba_output/misc/reads_table.tsv:

**Definitely:**

* Cluster `MicroEuk50` -> `MicroEuk30`
* Add `BiNI` biosynthetic novelty index to `biosynthetic.py`
* `busco_wrapper.py` that relabels all the genes, runs analysis, then converts output to tsv.
* Script to update genome clusters
* Script to update protein clusters
Expand Down
10 changes: 2 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,13 @@
[issues-shield]: https://img.shields.io/github/issues/jolespin/veba.svg?style=for-the-badge
[issues-url]: https://github.com/jolespin/veba/issues

```
_ _ _______ ______ _______
\ / |______ |_____] |_____|
\/ |______ |_____] | |
```
[![Schematic](images/graphical-abstract.png)](images/Schematic.pdf)

### What is VEBA?
The *Viral Eukaryotic Bacterial Archaeal* (VEBA) is an open-source software suite developed with all domains of microorganisms as the primary objective (not post hoc adjustments) including prokaryotic, eukaryotic, and viral organisms. VEBA is an end-to-end metagenomics and bioprospecting software suite that can directly recover and analyze eukaryotic and viral genomes in addition to prokaryotic genomes with native support for candidate phyla radiation (CPR). VEBA implements a novel iterative binning procedure and an optional hybrid sample-specific/multi-sample framework that recovers more genomes than non-iterative methods. To optimize the microeukaryotic gene calling and taxonomic classifications, VEBA includes a consensus microeukaryotic database containing protists and fungi compiled from several existing databases. VEBA also provides a unique clustering-based dereplication strategy allowing for sample-specific genomes and proteins to be directly compared across non-overlapping biological samples. VEBA also automates biosynthetic gene cluster identification and novelty scores for bioprospecting.

VEBA's mission is to make robust (meta-)genomics/transcriptomics analysis effortless. The philosophy of VEBA is that workflows should be modular, generalizable, and easy-to-use with minimal intermediate steps. The approach implemented in VEBA is to (try and) think 2 steps ahead of what you may need to do and automate the task for you.

[![Schematic](images/Schematic.png)](images/Schematic.pdf)

<p align="right"><a href="#readme-top">^__^</a></p>

___________________________________________________________________
Expand All @@ -37,7 +31,7 @@ ___________________________________________________________________
* Espinoza JL, Phillips A, Prentic MB, Tan GS, Kamath PL, Lloyd KG, Dupont CL. Unveiling the Microbial Realm with VEBA 2.0: A modular bioinformatics suite for end-to-end genome-resolved prokaryotic, (micro)eukaryotic, and viral multi-omics from either short- or long-read sequencing. [BioRxiv Preprint: doi.org/10.1101/2024.03.08.583560v2](https://www.biorxiv.org/content/10.1101/2024.03.08.583560v2). In review somewhere else.
* Espinoza JL, Dupont CL. VEBA: a modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes. BMC Bioinformatics. 2022 Oct 12;23(1):419. [doi: 10.1186/s12859-022-04973-8](https://doi.org/10.1186/s12859-022-04973-8). PMID: 36224545.

Please cite the software dependencies described under the [*Dependency Citation Table*](CITATIONS.md).
In addition to the above, please cite the software dependencies described under the [*Dependency Citation Table*](CITATIONS.md).

<p align="right"><a href="#readme-top">^__^</a></p>

Expand Down
443 changes: 346 additions & 97 deletions bin/README.md

Large diffs are not rendered by default.

Binary file added images/graphical-abstract.pdf
Binary file not shown.
Binary file added images/graphical-abstract.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
44 changes: 44 additions & 0 deletions images/graphical-abstract/Modules/assembly.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@


```mermaid
%%{init: { "flowchart": { "curve": "linear" } } }%%
%% Available curve styles include basis, bumpX, bumpY, cardinal, catmullRom, linear, monotoneX, monotoneY, natural, step, stepAfter, and stepBefore. %%%
graph LR
subgraph "`**assembly**`"
%% Programs
METASPADES["metaSPAdes"]
SAMTOOLS["samtools"]
BOWTIE2_INDEX["bowtie2-build"]
BOWTIE2["bowtie2"]
FEATURECOUNTS["featureCounts"]
SEQKIT["seqkit stats"]
%% inputs
READS[\"cleaned_1/2.fastq.gz"/]
%% outputs
STATS["statistics.tsv"]
%% FastP
READS --repair.sh--> METASPADES
METASPADES --> ASSEMBLY["scaffolds.fasta"]
ASSEMBLY --"fasta_to_saf.py"--> SAF["scaffolds.fasta.saf"]
%% Bowtie2
ASSEMBLY --> BOWTIE2_INDEX --> INDEX["scaffolds.fasta.*.bt2"]
READS & INDEX --> BOWTIE2 --> SAMTOOLS --> BAM["mapped.sorted.bam"]
%% featureCounts
BAM & SAF --> FEATURECOUNTS --> COUNTS["counts.tsv"]
ASSEMBLY --> SEQKIT --> STATS
end
```
65 changes: 65 additions & 0 deletions images/graphical-abstract/Modules/binning-prokaryotic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@


```mermaid
%%{init: { "flowchart": { "curve": "linear" } } }%%
%% Available curve styles include basis, bumpX, bumpY, cardinal, catmullRom, linear, monotoneX, monotoneY, natural, step, stepAfter, and stepBefore. %%%
graph TD
%% Programs
COVERM["coverm"]
PYRODIGAL["Pyrodigal"]
METABAT2["Metabat2"]
MAXBIN2_107["MaxBin2(MarkerSet=107)"]
MAXBIN2_40["MaxBin2(MarkerSet=40)"]
CONCOCT["CONCOCT"]
DASTOOL["DAS_Tool"]
TIARA["Tiara"]
CHECKM2["CheckM2"]
BARRNAP["barrnap"]
TRNASCANSE["tRNAscan-SE"]
FEATURECOUNTS["featureCounts"]
SEQKIT["seqkit stats"]
%% inputs
ASSEMBLY["scaffolds.fasta"]
BAM["mapped.sorted.bam"]
%% outputs
STATS["statistics.tsv"]
BAM --> COVERM --> COVERAGE["coverage.tsv"]
ASSEMBLY --> PYRODIGAL --> PROTEINS["proteins.fasta"] & CDS["cds.fasta"] & GFF["gene_models.gff"]
subgraph "`**_N_ iterative binning-prokaryotic**`"
ASSEMBLY & COVERAGE --> METABAT2 --> MAGS_METABAT["MAGs<SUB>Metabat2</SUB>"]
ASSEMBLY & COVERAGE --> MAXBIN2_107 --> MAGS_MAXBIN2_107["MAGs<SUB>MaxBin2_107</SUB>"]
ASSEMBLY & COVERAGE --> MAXBIN2_40 --> MAGS_MAXBIN2_40["MAGs<SUB>MaxBin2_40</SUB>"]
ASSEMBLY & COVERAGE --> CONCOCT --> MAGS_CONCOCT["MAGs<SUB>CONCOCT</SUB>"]
MAGS_MAXBIN2_107 & MAGS_MAXBIN2_40 & MAGS_CONCOCT & PROTEINS --> DASTOOL
DASTOOL --> CANDIDATE_MAGS["MAGs<SUB>Candidate</SUB>"]
CANDIDATE_MAGS --> TIARA
TIARA --> MAGS_P["MAGs<SUB>Prokaryotic</SUB>"]
TIARA --x MAGS_E["MAGs<SUB>Eukaryotic</SUB>"]
MAGS_P & PROTEINS --> CHECKM2
CHECKM2 --> MAGS_PASSED["MAGs<SUB>Passed</SUB>"]
CHECKM2 --x MAGS_FAILED["MAGs<SUB>Failed</SUB>"] --> UNBINNED["unbinned.fasta"] --> BEGINNING["Repeat with unbinned.fasta"]
end
MAGS_PASSED --> BARRNAP --> RRNA["MAGS.rRNA.fasta"]
MAGS_PASSED --> TRNASCANSE --> TRNA["MAGS.TRNA.fasta"]
MAGS_PASSED & CDS & RRNA & TRNA --> SEQKIT --> STATS
```
114 changes: 114 additions & 0 deletions images/graphical-abstract/Modules/mermaid_test.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@


```mermaid
%%{init: { "flowchart": { "curve": "linear" } } }%%
%% Available curve styles include basis, bumpX, bumpY, cardinal, catmullRom, linear, monotoneX, monotoneY, natural, step, stepAfter, and stepBefore. %%%
graph TD
subgraph "`**preprocessing**`"
%% modules
PREPROCESS_SHORT(["`_preprocess-short_`"])
PREPROCESS_LONG(["`_preprocess-long_`"])
%% inputs
R1[\"Illumina_1.fastq.gz"/]
R2[\"Illumina_2.fastq.gz"/]
LONG[\"ONT|PacBio.fastq.gz"/]
%% databases
CONTAMINATION[(Contamination)]
KMER[(K-mer Profiles)]
%% ---
%% preprocess/-long
R1 & R2 --> PREPROCESS_SHORT
CONTAMINATION -.-> PREPROCESS_SHORT
KMER -.-> PREPROCESS_SHORT
LONG --> PREPROCESS_LONG
CONTAMINATION -.-> PREPROCESS_LONG
KMER -.-> PREPROCESS_LONG
end
subgraph "`**assembly**`"
%%inputs
ASSEMBLY(["`_assembly|assembly-long_`"])
%% outputs
ASSEMBLY_FASTA[["assembly.fasta"]]
BAM[["mapped.sorted.bam"]]
%% assembly/-long
PREPROCESS_SHORT --cleaned_1/2.fastq.gz--> ASSEMBLY
PREPROCESS_LONG --cleaned.fastq.gz--> ASSEMBLY
ASSEMBLY --> ASSEMBLY_FASTA & BAM
end
%% --
subgraph "`**binning**`"
%% modules
BINNING_VIRAL(["`_binning-viral_`"])
BINNING_PROKARYOTIC(["`_binning-prokaryotic_`"])
BINNING_EUKARYOTIC(["`_binning-eukaryotic_`"])
%% outputs
GENOMES_AND_GENE_MODELS("Genomes & Gene Models")
GENOMES[["Genomes"]]
GENE_MODELS[["Gene Models"]]
%% databases
%%CHECKV[("CheckV")]--> BINNING_VIRAL
%%GENOMAD[("geNomad")]--> BINNING_VIRAL
%% --
%% binning-viral
ASSEMBLY_FASTA & BAM --> BINNING_VIRAL
%% binning-prokaryotic
BINNING_VIRAL --unbinned.fasta--> BINNING_PROKARYOTIC
BAM --> BINNING_PROKARYOTIC
%% binning-eukaryotic
BINNING_PROKARYOTIC --unbinned.fasta--> BINNING_EUKARYOTIC
BAM --> BINNING_EUKARYOTIC
%% coverage
%% COVERAGE("coverage|coverage-long")
BINNING_VIRAL & BINNING_PROKARYOTIC & BINNING_EUKARYOTIC --"genome-resolved"--> GENOMES_AND_GENE_MODELS
GENOMES_AND_GENE_MODELS --> GENOMES & GENE_MODELS
end
%% --
subgraph "`**clustering**`"
%% modules
CLUSTER("`_cluster_`")
%% output
PROTEIN_CLUSTERS[["SLC-specific Protein Clusters (SSPC)"]]
GENOME_CLUSTERS[["Species-level Clusters (SLC)"]]
%% cluster
GENOMES & GENE_MODELS--> CLUSTER
CLUSTER --> GENOME_CLUSTERS
CLUSTER --> PROTEIN_CLUSTERS
end
subgraph "`**annotation**`"
ANNOTATE("`_annotate_`")
GENE_MODELS & PROTEIN_CLUSTERS --> ANNOTATE
end
```
53 changes: 53 additions & 0 deletions images/graphical-abstract/Modules/preprocess.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@


```mermaid
%%{init: { "flowchart": { "curve": "linear" } } }%%
%% Available curve styles include basis, bumpX, bumpY, cardinal, catmullRom, linear, monotoneX, monotoneY, natural, step, stepAfter, and stepBefore. %%%
graph LR
subgraph "`**preprocess**`"
%% Programs
FASTP["FastP"]
BOWTIE2["Bowtie2"]
SEQKIT["seqkit stats"]
BBDUK["BBDuk"]
%% Databases
CONTAMINATION[("Contamination")]
KMERS[("K-mer Profiles")]
%% inputs
READS[\"Illumina_1/2.fastq.gz"/]
%% outputs
STATS["statistics.tsv"]
%% FastP
READS --> FASTP
FASTP --"trimmed_1/2.fastq.gz"--> BOWTIE2
%% Bowtie2
CONTAMINATION --> BOWTIE2
BOWTIE2 --"cleaned_1/2.fastq.gz"--> BBDUK
BOWTIE2 --"contaminated_1/2.fastq.gz"--> STATS
%%BBDuk
KMERS --> BBDUK
READS --> SEQKIT
BOWTIE2 --> SEQKIT
BBDUK --"cleaned_1/2.non-kmer_hits.fastq.gz"--> SEQKIT
BBDUK --"cleaned_1/2.kmer_hits.fastq.gz"--> SEQKIT
SEQKIT --> STATS
end
```
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file added images/graphical-abstract/graphical-abstract.pdf
Binary file not shown.
Binary file added images/graphical-abstract/graphical-abstract.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
8 changes: 4 additions & 4 deletions install/DATABASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -228,7 +228,7 @@ VEBA’s Microeukaryotic Protein Database has been completely redesigned using t
**Deprecated:**

<details>
<summary> VEBA Database* version: VDB_v5.2 (243 GB) </summary>
<summary> VEBA Database version: VDB_v5.2 (243 GB) </summary>

* Added `MicrobeAnnotator-KEGG` [Zenodo: 10020074](https://zenodo.org/records/10020074) which includes KEGG module pathway information from [`MicrobeAnnotator`](https://doi.org/10.1186/s12859-020-03940-5).
* Added `CAZy` protein sequences from [`dbCAN2`](https://academic.oup.com/nar/article/46/W1/W95/4996582)
Expand Down Expand Up @@ -823,7 +823,7 @@ tree -L 3 .


<details>
<summary>VEBA Database* version: VDB_v3.1</summary>
<summary>VEBA Database version: VDB_v3.1</summary>

The same as `VDB_v3` but updates `VDB-Microeukaryotic_v2` to `VDB-Microeukaryotic_v2.1` which has a `reference.eukaryota_odb10.list` containing only the subset of identifiers that core eukaryotic markers (useful for classification).

Expand Down Expand Up @@ -933,7 +933,7 @@ tree -L 3 .


<details>
<summary>VEBA Database* version: VDB_v3</summary>
<summary>VEBA Database version: VDB_v3</summary>

```
tree -L 3 .
Expand Down Expand Up @@ -1031,7 +1031,7 @@ tree -L 3 .


<details>
<summary>VEBA Database* version: VDB_v2</summary>
<summary>VEBA Database version: VDB_v2</summary>

* Compatible with *VEBA* version: `v1.0.2a+`

Expand Down

0 comments on commit 8a72333

Please sign in to comment.