Skip to content

Commit

Permalink
v1.1.0
Browse files Browse the repository at this point in the history
  • Loading branch information
jolespin committed Mar 2, 2023
1 parent d37034d commit 6a420d7
Show file tree
Hide file tree
Showing 7 changed files with 58 additions and 61 deletions.
27 changes: 15 additions & 12 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
________________________________________________________________

#### Current Releases:
##### Release v1.1.0 (Currently testing before official release)
##### Release v1.1.0

* **Modules**:
* `annotate.py`
Expand All @@ -21,12 +21,14 @@ ________________________________________________________________
* Uses `binning_wrapper.py` for all binning. This makes it easier to add new binning algorithms in the future (e.g., `VAMB`). Also, check out the new multi-split binning functionality described below.
* Added `--skip_concoct` in addition to the already existing `--skip_maxbin2` option as `MaxBin2` takes very long when there's a lot of contigs and `CONCOCT` takes a long time when there are a lot of samples (i.e., BAM files). `MetaBAT2` is not optional.
* `binning-viral.py`
* Complete rewrite of this module which now uses `geNomad` as the default binning algorithm but still supports `VirFinder`.
* If `VirFinder` is used, the `genomad annotate` is run via the `genomad_taxonomy_wrapper.py` script included in the update.
* Updated `Prodigal` → `Prodigal-GV` to handle additional viral genetic codes.
* `biosynthetic.py`
* Introduces `component_id` and `bgc_id` which are unique, pareseable, and informative. For example, `component_id = SRR17458614__CONCOCT__P.2__9|NODE_3319_length_2682_cov_2.840502|region001_1|2-2681(+)` contains the unique `bgc_id` (i.e., `SRR17458614__CONCOCT__P.2__9|NODE_3319_length_2682_cov_2.840502|region001`), shows that it is the 1st gene in the cluster (the `_1` in `region001_1`), and the gene start/end/strand. The `bgc_id` is composed of the `genome_id|contig_id|region_id`.
* `classify-prokaryotic.py`
* Updated `GTDB-Tk v2.1.1` → `GTDB-Tk v2.2.3`. For now, `--skip_ani_screen` is the only option because of [this thread](https://forum.gtdb.ecogenomic.org/t/how-can-i-use-the-mash-db-option-of-classify-wf/429/4). However, `--mash_db` may be an option in the near future.
* Added functionality to classify prokaryotic genomes that were not binned via `VEBA` which is available with the `--genomes` option (`--prokaryotic_binning_directory` is still available which can leverage existing intermediate files).
Expand All @@ -39,14 +41,18 @@ ________________________________________________________________
* Added functionality to classify viral genomes that were not binned via `VEBA` which is available with the `--genomes` option (`--viral_binning_directory` is still available which can leverage existing intermediate files).

* `cluster.py`
* Complete rewrite of this module which now uses `MMSEQS2` as the orthogroup detection algorithm instead of `OrthoFinder`. `OrthoFinder` is overkill for creating protein clusters and it generates thousands of intermediate files (e.g., fasta, alignments, trees, etc.) which substantially increases the compute time. `MMSEQS2` has very similar performance with a fraction of the resources and compute time. Clustered the entire [Plastisphere dataset](https://figshare.com/articles/dataset/Genome_assemblies_gene_models_gene_annotations_taxonomy_classifications_clusters_and_counts_tables/20263974) on a local machine in ~30 minutes compared to several days on a HPC.
* Complete rewrite of this module which now uses `MMSEQS2` as the orthogroup detection algorithm instead of `OrthoFinder`. `OrthoFinder` is overkill for creating protein clusters and it generates thousands of intermediate files (e.g., fasta, alignments, trees, etc.) which substantially increases the compute time. `MMSEQS2` has very similar performance with a fraction of the resources and compute time. Clustered the entire [*Plastisphere* dataset](https://figshare.com/articles/dataset/Genome_assemblies_gene_models_gene_annotations_taxonomy_classifications_clusters_and_counts_tables/20263974) on a local machine in ~30 minutes compared to several days on a HPC.
* Now that the resources are minimal, clustering is performed at global level as before (i.e., all samples in the dataset) and now at the local level, optionally but ON by default, which clusters all genomes within a sample. Accompanying wrapper scripts are `global_clustering.py` and `local_clustering.py`.
* The genomic and functional feature compression ratios (FCR) (described [here](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04973-8)]) are now calculated automatically. The calculation is `1 - number_of_clusters/number_of_features` which can easily be converted into an unsupervised biodiversity metric. This is calculated at the global (original implementation) and local levels.
* Input is now a table with the following columns: `[organism_type]<tab>[id_sample]<tab>[id_mag]<tab>[genome]<tab>[proteins]` and is generated easily with the `compile_genomes_table.py` script. This allows clustering to be performed for prokaryotes, eukaryotes, and viruses all at the same time.
* SLC-specific orthogroups (SSO) are now refered to as SLC-specific protein clusters (SSPC).
* Support zfilling (e.g., `zfill=3, SLC7 → SLC007`) for genomic and protein clusters.
* Deprecated `fastani_to_clusters.py` to now use the more generalizable `edgelist_to_clusters.py` which is used for both genomic and protein clusters. This also outputs a `NetworkX` graph and a pickled dictionary `{"cluster_a":{"component_1", "component_2", ..., "component_n"}}`
* `phylogeny.py`
* Updated `MUSCLE` to `v5` which has `-align` and `-super5` algorithms which are now accessible with `--alignment_algorithm`. Cannot use `stdin` so now the fasta files are not gzipped. The `merge_msa.py` now output uncompressed fasta as default and can output gzipped with the `--gzip` flag.

* **`VEBA Database`**:
* `VDB_v3.1` → `VDB_v4`
* Updated `CheckV DB v1.0` → `CheckV DB v1.5`
Expand All @@ -57,7 +63,7 @@ ________________________________________________________________
* Added `reference.eukaryota_odb10.list` and corresponding `MMSEQS2` database (i.e., `microeukaryotic.eukaryota_odb10`)
* Added `NCBIfam-AMRFinder` marker set for annotation
* Added `AntiFam` marker set for contamination
* Marker sets HMMs are now all gzipped (previously could not gzip because CheckM CPR workflow)
* Marker sets HMMs are now all gzipped (previously could not gzip because `CheckM` CPR workflow)

* **Scripts:**
* Added:
Expand Down Expand Up @@ -89,6 +95,7 @@ ________________________________________________________________
* `subset_table.py` - Added option to set index column and to drop duplicates.
* `virfinder_wrapper.r` - Used to be `VirFinder_wrapper.R`. This now has an option to use FDR values instead of P values.
* `merge_annotations_and_score_taxonomy.py` - Completely rewritten. Uses `taxopy` instead of `ete3`.
* `merge_msa.py` - Output uncompressed protein fasta files by default and can compress with `--gzip` flag.

* Deprecated:
* `adjust_genomes_for_cpr.py`
Expand Down Expand Up @@ -162,7 +169,7 @@ ___
________________________________________________________________


#### Future Releases:
#### Path to `v2.0.0`:

**Definitely:**

Expand All @@ -172,28 +179,24 @@ ________________________________________________________________
* Automated consensus protein cluster annotations. First need to create a hierarchical naming scheme that uses NR > KOFAM > Pfam.
* Create a wrapper around `hmmsearch` that takes in score cutoffs and outputs a useable table. This will be used in place of `KOFAMSCAN` which creates thousands of intermediate files.
* Add MAG-level counts to prokaryotic and eukaryotic. Add optional bam file for viral binning, if so then add MAG-level counts
*
* Support genome table input for `biosynthetic.py`, `phylogeny.py`, `index.py`, etc.
* Install each module via `bioconda`

**Probably (Yes)?:**
* Add a `metabolic.py` module
* Swap [`TransDecoder`](https://github.com/TransDecoder/TransDecoder) for [`TransSuite`](https://github.com/anonconda/TranSuite)
* Add support for `Anvi'o` object export in `cluster.py`
* Add spatial coverage to `coverage.py` script like in `mapping.py` module? Maybe just the samtools coverage output.


**...Maybe (Not)?**

* Add `VAMB` as an option for `binning-prokaryotic.py` (requires python >= 3.7,<3.8)
* Add an option for sample name prefix in `assembly.py`






________________________________________________________________



#### Change Log:
* [2023.2.23] - The largest update to date. Please refer to v1.1 for details on what has been changed.
* [2023.01.20] - Changed `-a --ani` to `-t --threshold` in `fastani_to_clusters.py` to match the usage in `edgelist_to_clusters.py` which is a generalization of `fastani_to_clusters.py` developed for `MMSEQS2` and `Diamond` implementations.
Expand Down
9 changes: 7 additions & 2 deletions install/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,15 +93,20 @@ cd veba/install

**2. Install VEBA environments**

For `v1.1.0`, this should take ~1.75 hours (~108 minutes). With the update from `CheckM1` -> `CheckM2` and installation of `antiSMASH`, at ~15 GB of memory must be available so grid access may be necessary.
**Recommended resource allocatation:** 4 hours with 15 GB memory (include extra time for variable I/O speed for various hosts)

For `v1.1.0`, **this should take ~1.75 hours (~108 minutes) with ~15 GB memory allocated**. The update from `CheckM1` -> `CheckM2` and installation of `antiSMASH` require more memory and may require grid access if head node is limited.

```
bash install_veba.sh
```

**3. Activate the database conda environment, download, and configure databases**

⚠️ This step can ~11 hrs with 64 GB memory and should be run using a compute grid via SLURM or SunGridEngine. If this command is run on the head node it will likely fail or timeout if a connection is interrupted. The most computationally intensive steps are creating a `Diamond` database of NCBI's non-redundant reference and a `MMSEQS2` database of the microeukaryotic protein database. Note the duration will depend on several factors including your internet connection speed and the i/o of public repositories.
**Recommended resource allocatation:** 16 hours with 64 GB memory (include extra time for variable I/O speed for various hosts)


⚠️ **This step should take ~11 hrs with 64 GB memory** and should be run using a compute grid via SLURM or SunGridEngine. If this command is run on the head node it will likely fail or timeout if a connection is interrupted. The most computationally intensive steps are creating a `Diamond` database of NCBI's non-redundant reference and a `MMSEQS2` database of the microeukaryotic protein database. Note the duration will depend on several factors including your internet connection speed and the i/o of public repositories.

If issues arise, please [submit a GitHub issue](https://github.com/jolespin/veba/issues) prefixed with `[Database]`. We are here to help :)

Expand Down
25 changes: 15 additions & 10 deletions install/update_environment_variables.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/bash
# __VERSION__ = "2022.12.27"
# __VERSION__ = "2023.3.1"

# Create database
DATABASE_DIRECTORY=${1:-"."}
Expand All @@ -10,7 +10,6 @@ CONDA_BASE=$(conda run -n base bash -c "echo \${CONDA_PREFIX}")
echo ". .. ... ..... ........ ............."
echo "i * Adding the following environment variable to VEBA environments: export VEBA_DATABASE=${REALPATH_DATABASE_DIRECTORY}"


# VEBA
for ENV_PREFIX in ${CONDA_BASE}/envs/VEBA-*; do
echo $ENV_PREFIX;
Expand All @@ -20,23 +19,29 @@ for ENV_PREFIX in ${CONDA_BASE}/envs/VEBA-*; do
echo "unset VEBA_DATABASE" > ${ENV_PREFIX}/etc/conda/deactivate.d/veba.sh
done

#GTDB-Tk/CheckM
#CheckM2
echo ". .. ... ..... ........ ............."
echo "xiii * Adding the following environment variable to VEBA environments: export CHECKM2DB=${REALPATH_DATABASE_DIRECTORY}/Classify/CheckM2/uniref100.KO.1.dmnd"
for ENV_NAME in VEBA-binning-prokaryotic_env; do
ENV_PREFIX=${CONDA_BASE}/envs/${ENV_NAME}
# CheckM2
echo "export CHECKM2DB=${REALPATH_DATABASE_DIRECTORY}/Classify/CheckM2/uniref100.KO.1.dmnd" >> ${ENV_PREFIX}/etc/conda/activate.d/veba.sh
echo "unset CHECKM2DB" >> ${ENV_PREFIX}/etc/conda/deactivate.d/veba.sh
done

#GTDB-Tk
echo ". .. ... ..... ........ ............."
echo "ii * Adding the following environment variable to VEBA environments: export GTDBTK_DATA_PATH=${REALPATH_DATABASE_DIRECTORY}/Classify/GTDBTk/"
for ENV_NAME in VEBA-binning-prokaryotic_env VEBA-classify_env; do
echo "xiv * Adding the following environment variable to VEBA environments: export GTDBTK_DATA_PATH=${REALPATH_DATABASE_DIRECTORY}/Classify/GTDBTk/"
for ENV_NAME in VEBA-classify_env; do
ENV_PREFIX=${CONDA_BASE}/envs/${ENV_NAME}
# GTDB-Tk
# GTDBTK_DATABASE_VERSION=$(ls ${REALPATH_DATABASE_DIRECTORY}/Classify/GTDBTk)
echo "export GTDBTK_DATA_PATH=${REALPATH_DATABASE_DIRECTORY}/Classify/GTDBTk/" >> ${ENV_PREFIX}/etc/conda/activate.d/veba.sh
echo "unset GTDBTK_DATA_PATH" >> ${ENV_PREFIX}/etc/conda/deactivate.d/veba.sh
# CheckM
echo "export CHECKM_DATA_PATH=${REALPATH_DATABASE_DIRECTORY}/Classify/CheckM/" >> ${ENV_PREFIX}/etc/conda/activate.d/veba.sh
echo "unset CHECKM_DATA_PATH" >> ${ENV_PREFIX}/etc/conda/deactivate.d/veba.sh
done

# CheckV
echo ". .. ... ..... ........ ............."
echo "iii * Adding the following environment variable to VEBA environments: export CHECKVDB=${REALPATH_DATABASE_DIRECTORY}/Classify/CheckV/"
echo "xv * Adding the following environment variable to VEBA environments: export CHECKVDB=${REALPATH_DATABASE_DIRECTORY}/Classify/CheckV/"
for ENV_NAME in VEBA-binning-viral_env; do
ENV_PREFIX=${CONDA_BASE}/envs/${ENV_NAME}
echo "export CHECKVDB=${REALPATH_DATABASE_DIRECTORY}/Classify/CheckV" >> ${ENV_PREFIX}/etc/conda/activate.d/veba.sh
Expand Down
4 changes: 2 additions & 2 deletions src/biosynthetic.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
__version__ = "2023.2.1"
__version__ = "2023.3.1"


# Assembly
Expand Down Expand Up @@ -397,7 +397,7 @@ def main(args=None):
parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter)
# Pipeline
parser_io = parser.add_argument_group('Required I/O arguments')
parser_io.add_argument("-m","--mags", type=str, help = "Tab-seperated value table of [id_mag]<tab>[path/to/genome.fasta]")
parser_io.add_argument("-m","--mags", type=str, help = "Either 1) Tab-seperated value table of [id_mag]<tab>[path/to/genome.fasta]; or 2) a file with a list filepaths for genomes (Must have same extension)")
parser_io.add_argument("-g","--gene_models", type=str, help = "Tab-seperated value table of [id_mag]<tab>[path/to/gene_models.gff]. Must be gff3.")
parser_io.add_argument("-o","--output_directory", type=str, default="veba_output/cluster", help = "path/to/project_directory [Default: veba_output/cluster]")
parser_io.add_argument("--mags_extension", type=str, default="fa", help = "Fasta file extension for --mags if a list is provided [Default: fa]")
Expand Down
40 changes: 9 additions & 31 deletions src/phylogeny.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
__version__ = "2023.2.1"
__version__ = "2023.3.1"

# Assembly
def preprocess( input_filepaths, output_filepaths, output_directory, directories, opts):
Expand Down Expand Up @@ -105,7 +105,7 @@ def get_msa_cmd( input_filepaths, output_filepaths, output_directory, directorie
cmd = [
"""
# MUSCLE
cut -f1 %s | %s -j %d 'echo "[MUSCLE] {}" && zcat "%s/{}.faa.gz" | %s -out %s/{}.msa %s'
cut -f1 %s | %s -j %d 'echo "[MUSCLE] {}" && %s -%s %s/{}.faa -output %s/{}.msa %s -threads 1'
# ClipKIT
cut -f1 %s | %s -j %d 'echo "[ClipKIT] {}" && %s %s/{}.msa -m %s -o %s/{}.msa.clipkit %s'
Expand All @@ -117,8 +117,9 @@ def get_msa_cmd( input_filepaths, output_filepaths, output_directory, directorie
os.path.join(directories[("intermediate", "1__hmmsearch")], "markers.tsv"),
os.environ["parallel"],
opts.n_jobs,
directories[("intermediate", "1__hmmsearch")],
os.environ["muscle"],
opts.alignment_algorithm,
directories[("intermediate", "1__hmmsearch")],
output_directory,
opts.muscle_options,

Expand All @@ -142,33 +143,7 @@ def get_msa_cmd( input_filepaths, output_filepaths, output_directory, directorie
os.path.join(output_directory, "alignment_table.boolean.tsv.gz"),
)
]
# """

# for ID_MARKER in $(cut -f1 %s);
# do echo $ID_MARKER;
# FAA=%s/${ID_MARKER}.faa.gz

# # Run MUSCLE
# zcat ${FAA} | %s -out %s/${ID_MARKER}.msa %s


# # Run ClipKIT
# %s %s/${ID_MARKER}.msa -m %s -o %s/${ID_MARKER}.msa.clipkit %s

# done
# """%(
# os.path.join(directories[("intermediate", "1__hmmsearch")], "markers.tsv"),
# directories[("intermediate", "1__hmmsearch")],
# os.environ["muscle"],
# output_directory,
# opts.muscle_options,
# os.environ["clipkit"],
# output_directory,
# opts.clipkit_mode,
# output_directory,
# opts.clipkit_options,
# ),
# ]

return cmd

# FastTree
Expand Down Expand Up @@ -542,6 +517,8 @@ def configure_parameters(opts, directories):
if opts.hmmsearch_threshold.lower() == "e":
opts.hmmsearch_threshold = None
assert_acceptable_arguments(opts.hmm_marker_field, {"accession", "name"})
assert_acceptable_arguments(opts.alignment_algorithm, {"align", "super5"})

# Set environment variables
add_executables_to_environment(opts=opts)

Expand All @@ -560,7 +537,7 @@ def main(args=None):
# Pipeline
parser_io = parser.add_argument_group('Required I/O arguments')
parser_io.add_argument("-d", "--database_hmm", type=str, help=f"path/to/HMM database of markers")
parser_io.add_argument("-a","--proteins", type=str, help = "Can be the following format: 1) Tab-seperated value table of [id_mag]<tab>[path/to/protein.fasta]; 2) List of filepaths [path/to/protein.fasta]; or 3) Directory of protein fasta using --protein_extension")
parser_io.add_argument("-a","--proteins", type=str, help = "Can be the following format: 1) Tab-seperated value table of [id_mag]<tab>[path/to/protein.fasta]; 2) Files with list of filepaths [path/to/protein.fasta]; or 3) Directory of protein fasta using --extension")
parser_io.add_argument("-o","--output_directory", type=str, default="veba_output/phylogeny", help = "path/to/project_directory [Default: veba_output/phylogeny]")
parser_io.add_argument("-x", "--extension", type=str, default="faa", help = "Fasta file extension for proteins if a list is provided [Default: faa]")

Expand All @@ -582,6 +559,7 @@ def main(args=None):

# Muscle
parser_alignment = parser.add_argument_group('Alignment arguments')
parser_alignment.add_argument("-A", "--alignment_algorithm", type=str, default="align", help = "Muscle alignment algorithm. Align large input using Super5 algorithm if -align is too expensive. {align,super5} [Default: align]")
parser_alignment.add_argument("-g", "--minimum_genomes_aligned_ratio", type=float, default=0.95, help = "Minimum ratio of genomes include in alignment. This removes markers that are under represented. [Default: 0.95]")
parser_alignment.add_argument("-m", "--minimum_markers_aligned_ratio", type=float, default=0.2, help = "Minimum ratio of markers aligned. This removes genomes with few markers. Note, this is based on detected markers and NOT total markers in original HMM. [Default: 0.2]")
parser_alignment.add_argument("--muscle_options", type=str, default="", help="MUSCLE | More options (e.g. --arg 1 ) [Default: '']")
Expand Down
Loading

0 comments on commit 6a420d7

Please sign in to comment.