Skip to content

Commit

Permalink
Merge pull request #39 from jolespin/devel
Browse files Browse the repository at this point in the history
v1.4.1
  • Loading branch information
jolespin authored Dec 19, 2023
2 parents f2e1ebd + b76259f commit 960d7de
Show file tree
Hide file tree
Showing 10 changed files with 220 additions and 50 deletions.
32 changes: 1 addition & 31 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ ________________________________________________________________

#### Current Releases:

**Release v1.4.0 Highlights:**
**Release v1.4.1 Highlights:**

* **`VEBA` Modules:**

Expand All @@ -21,36 +21,6 @@ ________________________________________________________________

* Completely rebuilt `VEBA's Microeukaryotic Protein Database` to produce a clustered database `MicroEuk100/90/50` similar to `UniRef100/90/50`. Available on [doi:10.5281/zenodo.10139450](https://zenodo.org/records/10139451).

* **Number of sequences:**

* MicroEuk100 = 79,920,431 (19 GB)
* MicroEuk90 = 51,767,730 (13 GB)
* MicroEuk50 = 29,898,853 (6.5 GB)



* **Number of source organisms per dataset:**

* MycoCosm = 2503
* PhycoCosm = 174
* EnsemblProtists = 233
* MMETSP = 759
* TARA_SAGv1 = 8
* EukProt = 366
* EukZoo = 27
* TARA_SMAGv1 = 389
* NR_Protists-Fungi = 48217

<details>
<summary>**Release v1.4.0 Details**</summary>
* [2023.12.15] - Added `profile-taxonomic.py` module which uses `sylph` to build a sketch database for genomes and queries the genome database similar to `Kraken` for taxonomic abundance.
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ ___________________________________________________________________

### Announcements

* **What's new in `VEBA v1.4.0`?**
* **What's new in `VEBA v1.4.1`?**

* **`VEBA` Modules:**

Expand All @@ -67,7 +67,7 @@ ___________________________________________________________________

### Installation and databases

**Current Stable Version:** [`v1.4.0`](https://github.com/jolespin/veba/releases/tag/v1.4.0)
**Current Stable Version:** [`v1.4.1`](https://github.com/jolespin/veba/releases/tag/v1.4.1)

**Current Database Version:** `VDB_v6`

Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
1.4.0b
1.4.1
VDB_v6
1 change: 1 addition & 0 deletions images/Schematic/Schematic_v1.1.x.gslides
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"":"WARNING! DO NOT EDIT THIS FILE! ANY CHANGES MADE WILL BE LOST!","doc_id":"1L0LdxYJxvgSgINjKZXaOJS9UKtFbf_RC8lYxAFUhCTw","resource_key":"","email":"[email protected]"}
1 change: 1 addition & 0 deletions images/Schematic/Schematic_v1.2.0.gslides
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"":"WARNING! DO NOT EDIT THIS FILE! ANY CHANGES MADE WILL BE LOST!","doc_id":"1WzXffcWcl84a__OQHP5Qx0jeM50b2HCZZMT9vzNtvQw","resource_key":"","email":"[email protected]"}
212 changes: 207 additions & 5 deletions install/DATABASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,20 +21,218 @@ Please cite the following sources if these marker sets are used in any way:
Espinoza, Josh (2022): Profile HMM marker sets. figshare. Dataset. https://doi.org/10.6084/m9.figshare.19616016.v1

#### Microeukaryotic protein database:
A protein database is required not only for eukaryotic gene calls using MetaEuk but can also be used for MAG annotation. Many eukaryotic protein databases exist such as MMETSP, EukZoo, and EukProt, yet these are limited to marine environments, include prokaryotic sequences, or include eukaryotic sequences for organisms that would not be expected to be binned out of metagenomes such as metazoans. We combined and dereplicated MMETSP, EukZoo, EukProt, and NCBI non-redundant to include only microeukaryotes such as protists and fungi. This optimized microeukaryotic database ensures that only eukaryotic exons expected to be represented in metagenomes are utilized for eukaryotic gene modeling and the resulting MetaEuk reference targets are used for eukaryotic MAG classification. VEBA’s microeukaryotic protein database includes 48,006,918 proteins from 42,922 microeukaryotic strains.
VEBA’s Microeukaryotic Protein Database has been completely redesigned using the logic of UniRef and their clustered database. The previous microeukaryotic protein database contained 48,006,918 proteins from 44,647 source organisms while the updated database, MicroEuk, contains 79,920,430 proteins from 52,495 source organisms. As in the prior major release, MicroEuk concentrates on microeukaryotic organisms while excluding higher eukaryotes as these organisms are the primary eukaryotes targeted by shotgun metagenomics and metatranscriptomics. Source organisms in this context are defined as organisms in which the proteins were derived.

**Number of sequences:**

* MicroEuk100 = 79,920,431 (19 GB)
* MicroEuk90 = 51,767,730 (13 GB)
* MicroEuk50 = 29,898,853 (6.5 GB)



**Number of source organisms per dataset:**

* MycoCosm = 2503
* PhycoCosm = 174
* EnsemblProtists = 233
* MMETSP = 759
* TARA_SAGv1 = 8
* EukProt = 366
* EukZoo = 27
* TARA_SMAGv1 = 389
* NR_Protists-Fungi = 48217

**Current:**

* [VDB-Microeukaryotic\_v2.1](https://zenodo.org/record/7485114) available on Zenodo
* [MicroEuk\_v3](https://zenodo.org/records/10139451) available on Zenodo

**Deprecated:**

* [VDB-Microeukaryotic\_v1](https://figshare.com/articles/dataset/Microeukaryotic_Protein_Database/19668855) available on FigShare
* [MicroEuk\_v2](https://zenodo.org/record/7485114) available on Zenodo

* [MicroEuk\_v1](https://figshare.com/articles/dataset/Microeukaryotic_Protein_Database/19668855) available on FigShare

#### Database Structure:

**Current:**
*VEBA Database* version: `VDB_v5.2` (243 GB)

*VEBA Database* version: `VDB_v6` (272 GB)

* Added `MicroEuk_v3`

```
tree -L 3 .
.
├── ACCESS_DATE
├── Annotate
│   ├── CAZy
│   │   └── CAZyDB.07262023.dmnd
│   ├── KOFAM
│   │   ├── ko_list
│   │   └── profiles
│   ├── MIBiG
│   │   └── mibig_v3.1.dmnd
│   ├── MicrobeAnnotator-KEGG
│   │   ├── KEGG_Bifurcating_Module_Information.pkl
│   │   ├── KEGG_Bifurcating_Module_Information.pkl.md5
│   │   ├── KEGG_Module_Information.txt
│   │   ├── KEGG_Module_Information.txt.md5
│   │   ├── KEGG_Regular_Module_Information.pkl
│   │   ├── KEGG_Regular_Module_Information.pkl.md5
│   │   ├── KEGG_Structural_Module_Information.pkl
│   │   └── KEGG_Structural_Module_Information.pkl.md5
│   ├── MicrobeAnnotator-KEGG.tar.gz
│   ├── NCBIfam-AMRFinder
│   │   ├── NCBIfam-AMRFinder.changelog.txt
│   │   ├── NCBIfam-AMRFinder.hmm.gz
│   │   └── NCBIfam-AMRFinder.tsv
│   ├── Pfam
│   │   ├── Pfam-A.hmm.gz
│   │   └── relnotes.txt
│   ├── UniRef
│   │   ├── uniref50.dmnd
│   │   ├── uniref50.release_note
│   │   ├── uniref90.dmnd
│   │   └── uniref90.release_note
│   └── VFDB
│   └── VFDB_setA_pro.dmnd
├── Classify
│   ├── CheckM2
│   │   └── uniref100.KO.1.dmnd
│   ├── CheckV
│   │   ├── genome_db
│   │   ├── hmm_db
│   │   └── README.txt
│   ├── geNomad
│   │   ├── genomad_db
│   │   ├── genomad_db.dbtype
│   │   ├── genomad_db_h
│   │   ├── genomad_db_h.dbtype
│   │   ├── genomad_db_h.index
│   │   ├── genomad_db.index
│   │   ├── genomad_db.lookup
│   │   ├── genomad_db_mapping
│   │   ├── genomad_db.source
│   │   ├── genomad_db_taxonomy
│   │   ├── genomad_integrase_db
│   │   ├── genomad_integrase_db.dbtype
│   │   ├── genomad_integrase_db_h
│   │   ├── genomad_integrase_db_h.dbtype
│   │   ├── genomad_integrase_db_h.index
│   │   ├── genomad_integrase_db.index
│   │   ├── genomad_integrase_db.lookup
│   │   ├── genomad_integrase_db.source
│   │   ├── genomad_marker_metadata.tsv
│   │   ├── genomad_mini_db -> genomad_db
│   │   ├── genomad_mini_db.dbtype
│   │   ├── genomad_mini_db_h -> genomad_db_h
│   │   ├── genomad_mini_db_h.dbtype -> genomad_db_h.dbtype
│   │   ├── genomad_mini_db_h.index -> genomad_db_h.index
│   │   ├── genomad_mini_db.index
│   │   ├── genomad_mini_db.lookup -> genomad_db.lookup
│   │   ├── genomad_mini_db_mapping -> genomad_db_mapping
│   │   ├── genomad_mini_db.source -> genomad_db.source
│   │   ├── genomad_mini_db_taxonomy -> genomad_db_taxonomy
│   │   ├── mini_set_ids
│   │   ├── names.dmp
│   │   ├── nodes.dmp
│   │   ├── plasmid_hallmark_annotation.txt
│   │   ├── version.txt
│   │   └── virus_hallmark_annotation.txt
│   ├── GTDB
│   │   ├── fastani
│   │   ├── markers
│   │   ├── mash
│   │   ├── masks
│   │   ├── metadata
│   │   ├── mrca_red
│   │   ├── msa
│   │   ├── pplacer
│   │   ├── radii
│   │   ├── split
│   │   ├── taxonomy
│   │   └── temp
│   ├── MicroEuk
│   │   ├── MicroEuk100
│   │   ├── MicroEuk100.dbtype
│   │   ├── MicroEuk100.eukaryota_odb10
│   │   ├── MicroEuk100.eukaryota_odb10.dbtype
│   │   ├── MicroEuk100.eukaryota_odb10_h
│   │   ├── MicroEuk100.eukaryota_odb10_h.dbtype
│   │   ├── MicroEuk100.eukaryota_odb10_h.index
│   │   ├── MicroEuk100.eukaryota_odb10.index
│   │   ├── MicroEuk100.eukaryota_odb10.lookup
│   │   ├── MicroEuk100.eukaryota_odb10.source
│   │   ├── MicroEuk100_h
│   │   ├── MicroEuk100_h.dbtype
│   │   ├── MicroEuk100_h.index
│   │   ├── MicroEuk100.index
│   │   ├── MicroEuk100.lookup
│   │   ├── MicroEuk100_mapping
│   │   ├── MicroEuk100.source
│   │   ├── MicroEuk100_taxonomy
│   │   ├── MicroEuk50
│   │   ├── MicroEuk50.dbtype
│   │   ├── MicroEuk50_h
│   │   ├── MicroEuk50_h.dbtype
│   │   ├── MicroEuk50_h.index
│   │   ├── MicroEuk50.index
│   │   ├── MicroEuk50.lookup
│   │   ├── MicroEuk50.source
│   │   ├── MicroEuk90
│   │   ├── MicroEuk90.dbtype
│   │   ├── MicroEuk90_h
│   │   ├── MicroEuk90_h.dbtype
│   │   ├── MicroEuk90_h.index
│   │   ├── MicroEuk90.index
│   │   ├── MicroEuk90.lookup
│   │   ├── MicroEuk90.source
│   │   ├── source_taxonomy.tsv.gz
│   │   ├── source_to_lineage.dict.pkl.gz
│   │   └── target_to_source.dict.pkl.gz
│   └── NCBITaxonomy
│   ├── citations.dmp
│   ├── delnodes.dmp
│   ├── division.dmp
│   ├── gc.prt
│   ├── gencode.dmp
│   ├── merged.dmp
│   ├── names.dmp
│   ├── nodes.dmp
│   └── readme.txt
├── Contamination
│   ├── AntiFam
│   │   ├── AntiFam.hmm.gz
│   │   ├── relnotes
│   │   └── version
│   ├── chm13v2.0
│   │   ├── chm13v2.0.1.bt2
│   │   ├── chm13v2.0.2.bt2
│   │   ├── chm13v2.0.3.bt2
│   │   ├── chm13v2.0.4.bt2
│   │   ├── chm13v2.0.rev.1.bt2
│   │   └── chm13v2.0.rev.2.bt2
│   └── kmers
│   └── ribokmers.fa.gz
└── MarkerSets
├── Archaea_76.hmm.gz
├── Bacteria_71.hmm.gz
├── CPR_43.hmm.gz
├── eukaryota_odb10.hmm.gz
├── eukaryota_odb10.scores_cutoff.tsv.gz
├── Fungi_593.hmm.gz
├── Protista_83.hmm.gz
└── README
36 directories, 124 files
```

**Deprecated:**

<details>
<summary> *VEBA Database* version: `VDB_v5.2` (243 GB) </details>

* Added `MicrobeAnnotator-KEGG` [Zenodo: 10020074](https://zenodo.org/records/10020074) which includes KEGG module pathway information from [`MicrobeAnnotator`](https://doi.org/10.1186/s12859-020-03940-5).
* Added `CAZy` protein sequences from [`dbCAN2`](https://academic.oup.com/nar/article/46/W1/W95/4996582)
Expand Down Expand Up @@ -194,7 +392,7 @@ tree -L 3 .
37 directories, 112 files
```

**Deprecated:**
</details>

<details>
<summary> *VEBA Database* version: `VDB_v5.1` </summary>
Expand Down Expand Up @@ -340,6 +538,7 @@ tree -L 3 .
├── Protista_83.hmm.gz
└── README
```

</details>

<details>
Expand Down Expand Up @@ -481,6 +680,7 @@ tree -L 3 .
├── Protista_83.hmm.gz
└── README
```

</details>

<details>
Expand Down Expand Up @@ -622,6 +822,7 @@ tree -L 3 .
31 directories, 96 files
```

</details>


Expand Down Expand Up @@ -731,6 +932,7 @@ tree -L 3 .
35 directories, 60 files
```

</details>


Expand Down
2 changes: 1 addition & 1 deletion install/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ The `VEBA` installation is going to configure some `conda` environments for you
```
# For stable version, download and decompress the tarball:
VERSION="1.4.0"
VERSION="1.4.1"
wget https://github.com/jolespin/veba/archive/refs/tags/v${VERSION}.tar.gz
tar -xvf v${VERSION}.tar.gz && mv veba-${VERSION} veba
Expand Down
8 changes: 2 additions & 6 deletions install/download_databases.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/bash
# __version__ = "2023.12.11"
# __version__ = "2023.12.19"
# VEBA_DATABASE_VERSION = "VDB_v6"
# MICROEUKAYROTIC_DATABASE_VERSION = "MicroEuk_v3"

Expand Down Expand Up @@ -114,11 +114,7 @@ mmseqs createdb --compressed 1 ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.faa

# MicroEuk100.eukaryota_odb10
gzip -d ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.eukaryota_odb10.list.gz
seqkit grep -f ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.eukaryota_odb10.list ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.faa | mmseqs createdb --compressed 1 stdin ${DATABASE_DIRECTORY}/Classify/MicroEuk/MicroEuk100

# MicroEuk90
gzip -d -c ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk90_clusters.tsv.gz | cut -f1 | sort -u > ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk90.list
seqkit grep -f ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk90.list ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.faa | mmseqs createdb --compressed 1 stdin ${DATABASE_DIRECTORY}/Classify/MicroEuk/MicroEuk90
seqkit grep -f ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.eukaryota_odb10.list ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.faa | mmseqs createdb --compressed 1 stdin ${DATABASE_DIRECTORY}/Classify/MicroEuk/MicroEuk100.eukaryota_odb10

# MicroEuk90
gzip -d -c ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk90_clusters.tsv.gz | cut -f1 | sort -u > ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk90.list
Expand Down
2 changes: 1 addition & 1 deletion walkthroughs/adapting_commands_for_aws.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ This job definition pulls the [jolespin/veba_preprocess](https://hub.docker.com/
"jobDefinitionName": "preprocess__S1",
"type": "container",
"containerProperties": {
"image": "jolespin/veba_preprocess:1.4.0",
"image": "jolespin/veba_preprocess:1.4.1",
"command": [
"preprocess.py",
"-1",
Expand Down
Loading

0 comments on commit 960d7de

Please sign in to comment.