v1.1.0

jolespin · Mar 2, 2023 · 6a420d7 · 6a420d7
1 parent d37034d
commit 6a420d7
Show file tree

Hide file tree

Showing 7 changed files with 58 additions and 61 deletions.
diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md
@@ -5,7 +5,7 @@
 ________________________________________________________________
 
 #### Current Releases:
-##### Release v1.1.0 (Currently testing before official release)
+##### Release v1.1.0
 
 * **Modules**:
 	* `annotate.py`
@@ -21,12 +21,14 @@ ________________________________________________________________
 		* Uses `binning_wrapper.py` for all binning.  This makes it easier to add new binning algorithms in the future (e.g., `VAMB`).  Also, check out the new multi-split binning functionality described below.
 		* Added `--skip_concoct` in addition to the already existing `--skip_maxbin2` option as `MaxBin2` takes very long when there's a lot of contigs and `CONCOCT` takes a long time when there are a lot of samples (i.e., BAM files).  `MetaBAT2` is not optional.  
 		
-		
 	* `binning-viral.py`
 		* Complete rewrite of this module which now uses `geNomad` as the default binning algorithm but still supports `VirFinder`.
 		* If `VirFinder` is used, the `genomad annotate` is run via the `genomad_taxonomy_wrapper.py` script included in the update. 
 		* Updated `Prodigal` → `Prodigal-GV` to handle additional viral genetic codes.
 		
+	* `biosynthetic.py`
+		* Introduces `component_id` and `bgc_id` which are unique, pareseable, and informative.  For example, `component_id = SRR17458614__CONCOCT__P.2__9|NODE_3319_length_2682_cov_2.840502|region001_1|2-2681(+)` contains the unique `bgc_id` (i.e., `SRR17458614__CONCOCT__P.2__9|NODE_3319_length_2682_cov_2.840502|region001`), shows that it is the 1st gene in the cluster (the `_1` in `region001_1`), and the gene start/end/strand.  The `bgc_id` is composed of the `genome_id|contig_id|region_id`.
+		
 	* `classify-prokaryotic.py`
 		* Updated `GTDB-Tk v2.1.1` →  `GTDB-Tk v2.2.3`.  For now, `--skip_ani_screen` is the only option because of [this thread](https://forum.gtdb.ecogenomic.org/t/how-can-i-use-the-mash-db-option-of-classify-wf/429/4).  However, `--mash_db` may be an option in the near future.
 		* Added functionality to classify prokaryotic genomes that were not binned via `VEBA` which is available with the `--genomes` option (`--prokaryotic_binning_directory` is still available which can leverage existing intermediate files).		
@@ -39,14 +41,18 @@ ________________________________________________________________
 		* Added functionality to classify viral genomes that were not binned via `VEBA` which is available with the `--genomes` option (`--viral_binning_directory` is still available which can leverage existing intermediate files).
 
 	* `cluster.py`
-		* Complete rewrite of this module which now uses `MMSEQS2` as the orthogroup detection algorithm instead of `OrthoFinder`.  `OrthoFinder` is overkill for creating protein clusters and it generates thousands of intermediate files (e.g., fasta, alignments, trees, etc.) which substantially increases the compute time.  `MMSEQS2` has very similar performance with a fraction of the resources and compute time.  Clustered the entire [Plastisphere dataset](https://figshare.com/articles/dataset/Genome_assemblies_gene_models_gene_annotations_taxonomy_classifications_clusters_and_counts_tables/20263974) on a local machine in ~30 minutes compared to several days on a HPC.
+		* Complete rewrite of this module which now uses `MMSEQS2` as the orthogroup detection algorithm instead of `OrthoFinder`.  `OrthoFinder` is overkill for creating protein clusters and it generates thousands of intermediate files (e.g., fasta, alignments, trees, etc.) which substantially increases the compute time.  `MMSEQS2` has very similar performance with a fraction of the resources and compute time.  Clustered the entire [*Plastisphere* dataset](https://figshare.com/articles/dataset/Genome_assemblies_gene_models_gene_annotations_taxonomy_classifications_clusters_and_counts_tables/20263974) on a local machine in ~30 minutes compared to several days on a HPC.
 		* Now that the resources are minimal, clustering is performed at global level as before (i.e., all samples in the dataset) and now at the local level, optionally but ON by default, which clusters all genomes within a sample.  Accompanying wrapper scripts are `global_clustering.py` and `local_clustering.py`.
 		* The genomic and functional feature compression ratios (FCR) (described [here](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04973-8)]) are now calculated automatically.  The calculation is `1 - number_of_clusters/number_of_features` which can easily be converted into an unsupervised biodiversity metric.  This is calculated at the global (original implementation) and local levels.
 		* Input is now a table with the following columns: `[organism_type]<tab>[id_sample]<tab>[id_mag]<tab>[genome]<tab>[proteins]` and is generated easily with the `compile_genomes_table.py` script.  This allows clustering to be performed for prokaryotes, eukaryotes, and viruses all at the same time.
 		* SLC-specific orthogroups (SSO) are now refered to as SLC-specific protein clusters (SSPC).
 		* Support zfilling (e.g., `zfill=3, SLC7 → SLC007`) for genomic and protein clusters.
 		* Deprecated `fastani_to_clusters.py` to now use the more generalizable `edgelist_to_clusters.py` which is used for both genomic and protein clusters.  This also outputs a `NetworkX` graph and a pickled dictionary `{"cluster_a":{"component_1", "component_2", ..., "component_n"}}`
 		
+	* `phylogeny.py`
+		* Updated `MUSCLE` to `v5` which has `-align` and `-super5` algorithms which are now accessible with `--alignment_algorithm`.  Cannot use `stdin` so now the fasta files are not gzipped.  The `merge_msa.py` now output uncompressed fasta as default and can output gzipped with the `--gzip` flag.
+
+		
 * **`VEBA Database`**:
 	* `VDB_v3.1` → `VDB_v4`
 		* Updated `CheckV DB v1.0` → `CheckV DB v1.5`
@@ -57,7 +63,7 @@ ________________________________________________________________
 		* Added `reference.eukaryota_odb10.list` and corresponding `MMSEQS2` database (i.e., `microeukaryotic.eukaryota_odb10`)
 		* Added `NCBIfam-AMRFinder` marker set for annotation
 		* Added `AntiFam` marker set for contamination
-		* Marker sets HMMs are now all gzipped (previously could not gzip because CheckM CPR workflow)
+		* Marker sets HMMs are now all gzipped (previously could not gzip because `CheckM` CPR workflow)
 
 * **Scripts:**
 	* Added:
@@ -89,6 +95,7 @@ ________________________________________________________________
 		* `subset_table.py` - Added option to set index column and to drop duplicates.
 		* `virfinder_wrapper.r` - Used to be `VirFinder_wrapper.R`.  This now has an option to use FDR values instead of P values.
 		* `merge_annotations_and_score_taxonomy.py` - Completely rewritten.  Uses `taxopy` instead of `ete3`.
+		* `merge_msa.py` - Output uncompressed protein fasta files by default and can compress with `--gzip` flag.
 
 	* Deprecated:
 		* `adjust_genomes_for_cpr.py`
@@ -162,7 +169,7 @@ ___
 ________________________________________________________________
 
 
-#### Future Releases:
+#### Path to `v2.0.0`:
 
 **Definitely:**
 
@@ -172,28 +179,24 @@ ________________________________________________________________
 * Automated consensus protein cluster annotations.  First need to create a hierarchical naming scheme that uses NR > KOFAM > Pfam.
 * Create a wrapper around `hmmsearch` that takes in score cutoffs and outputs a useable table.  This will be used in place of `KOFAMSCAN` which creates thousands of intermediate files.
 * Add MAG-level counts to prokaryotic and eukaryotic. Add optional bam file for viral binning, if so then add MAG-level counts
-* 
+* Support genome table input for `biosynthetic.py`, `phylogeny.py`, `index.py`, etc.
+* Install each module via `bioconda`
+
 **Probably (Yes)?:**
 * Add a `metabolic.py` module
 * Swap [`TransDecoder`](https://github.com/TransDecoder/TransDecoder) for [`TransSuite`](https://github.com/anonconda/TranSuite)
 * Add support for `Anvi'o` object export in `cluster.py`
 * Add spatial coverage to `coverage.py` script like in `mapping.py` module? Maybe just the samtools coverage output.
 
-
 **...Maybe (Not)?**
 
 * Add `VAMB` as an option for `binning-prokaryotic.py` (requires python >= 3.7,<3.8)
 * Add an option for sample name prefix in `assembly.py`
 
 
-
-
-
-
 ________________________________________________________________
 
 
-
 #### Change Log:
 * [2023.2.23] - The largest update to date.  Please refer to v1.1 for details on what has been changed.
 * [2023.01.20] - Changed `-a --ani` to `-t --threshold` in `fastani_to_clusters.py` to match the usage in `edgelist_to_clusters.py` which is a generalization of `fastani_to_clusters.py` developed for `MMSEQS2` and `Diamond` implementations.

diff --git a/install/README.md b/install/README.md
@@ -93,15 +93,20 @@ cd veba/install
 
 **2. Install VEBA environments**
 
-For `v1.1.0`, this should take ~1.75 hours (~108 minutes).  With the update from `CheckM1` -> `CheckM2` and installation of `antiSMASH`, at ~15 GB of memory must be available so grid access may be necessary.
+**Recommended resource allocatation:** 4 hours with 15 GB memory (include extra time for variable I/O speed for various hosts)
+
+For `v1.1.0`, **this should take ~1.75 hours (~108 minutes) with ~15 GB memory allocated**.  The update from `CheckM1` -> `CheckM2` and installation of `antiSMASH` require more memory and may require grid access if head node is limited.
 
 ```
 bash install_veba.sh
 ```
 
 **3. Activate the database conda environment, download, and configure databases**
 
-⚠️ This step can ~11 hrs with 64 GB memory and should be run using a compute grid via SLURM or SunGridEngine.  If this command is run on the head node it will likely fail or timeout if a connection is interrupted. The most computationally intensive steps are creating a `Diamond` database of NCBI's non-redundant reference and a `MMSEQS2` database of the microeukaryotic protein database.  Note the duration will depend on several factors including your internet connection speed and the i/o of public repositories.
+**Recommended resource allocatation:** 16 hours with 64 GB memory (include extra time for variable I/O speed for various hosts)
+
+
+⚠️ **This step should take ~11 hrs with 64 GB memory** and should be run using a compute grid via SLURM or SunGridEngine.  If this command is run on the head node it will likely fail or timeout if a connection is interrupted. The most computationally intensive steps are creating a `Diamond` database of NCBI's non-redundant reference and a `MMSEQS2` database of the microeukaryotic protein database.  Note the duration will depend on several factors including your internet connection speed and the i/o of public repositories.
 
 If issues arise, please [submit a GitHub issue](https://github.com/jolespin/veba/issues) prefixed with `[Database]`. We are here to help :)
 

diff --git a/install/update_environment_variables.sh b/install/update_environment_variables.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# __VERSION__ = "2022.12.27"
+# __VERSION__ = "2023.3.1"
 
 # Create database
 DATABASE_DIRECTORY=${1:-"."}
@@ -10,7 +10,6 @@ CONDA_BASE=$(conda run -n base bash -c "echo \${CONDA_PREFIX}")
 echo ". .. ... ..... ........ ............."
 echo "i * Adding the following environment variable to VEBA environments: export VEBA_DATABASE=${REALPATH_DATABASE_DIRECTORY}"
 
-
 # VEBA
 for ENV_PREFIX in ${CONDA_BASE}/envs/VEBA-*; do 
     echo $ENV_PREFIX;
@@ -20,23 +19,29 @@ for ENV_PREFIX in ${CONDA_BASE}/envs/VEBA-*; do
     echo "unset VEBA_DATABASE" > ${ENV_PREFIX}/etc/conda/deactivate.d/veba.sh
     done
 
-#GTDB-Tk/CheckM
+#CheckM2
+echo ". .. ... ..... ........ ............."
+echo "xiii * Adding the following environment variable to VEBA environments: export CHECKM2DB=${REALPATH_DATABASE_DIRECTORY}/Classify/CheckM2/uniref100.KO.1.dmnd"
+for ENV_NAME in VEBA-binning-prokaryotic_env; do 
+    ENV_PREFIX=${CONDA_BASE}/envs/${ENV_NAME}
+    # CheckM2
+    echo "export CHECKM2DB=${REALPATH_DATABASE_DIRECTORY}/Classify/CheckM2/uniref100.KO.1.dmnd" >> ${ENV_PREFIX}/etc/conda/activate.d/veba.sh
+    echo "unset CHECKM2DB" >> ${ENV_PREFIX}/etc/conda/deactivate.d/veba.sh    
+    done 
+
+#GTDB-Tk
 echo ". .. ... ..... ........ ............."
-echo "ii * Adding the following environment variable to VEBA environments: export GTDBTK_DATA_PATH=${REALPATH_DATABASE_DIRECTORY}/Classify/GTDBTk/"
-for ENV_NAME in VEBA-binning-prokaryotic_env VEBA-classify_env; do 
+echo "xiv * Adding the following environment variable to VEBA environments: export GTDBTK_DATA_PATH=${REALPATH_DATABASE_DIRECTORY}/Classify/GTDBTk/"
+for ENV_NAME in VEBA-classify_env; do 
     ENV_PREFIX=${CONDA_BASE}/envs/${ENV_NAME}
     # GTDB-Tk
-    # GTDBTK_DATABASE_VERSION=$(ls ${REALPATH_DATABASE_DIRECTORY}/Classify/GTDBTk)
     echo "export GTDBTK_DATA_PATH=${REALPATH_DATABASE_DIRECTORY}/Classify/GTDBTk/" >> ${ENV_PREFIX}/etc/conda/activate.d/veba.sh
     echo "unset GTDBTK_DATA_PATH" >> ${ENV_PREFIX}/etc/conda/deactivate.d/veba.sh
-    # CheckM
-    echo "export CHECKM_DATA_PATH=${REALPATH_DATABASE_DIRECTORY}/Classify/CheckM/" >> ${ENV_PREFIX}/etc/conda/activate.d/veba.sh
-    echo "unset CHECKM_DATA_PATH" >> ${ENV_PREFIX}/etc/conda/deactivate.d/veba.sh    
     done 
 
 # CheckV
 echo ". .. ... ..... ........ ............."
-echo "iii * Adding the following environment variable to VEBA environments: export CHECKVDB=${REALPATH_DATABASE_DIRECTORY}/Classify/CheckV/"
+echo "xv * Adding the following environment variable to VEBA environments: export CHECKVDB=${REALPATH_DATABASE_DIRECTORY}/Classify/CheckV/"
 for ENV_NAME in VEBA-binning-viral_env; do 
     ENV_PREFIX=${CONDA_BASE}/envs/${ENV_NAME}
     echo "export CHECKVDB=${REALPATH_DATABASE_DIRECTORY}/Classify/CheckV" >> ${ENV_PREFIX}/etc/conda/activate.d/veba.sh

diff --git a/src/biosynthetic.py b/src/biosynthetic.py
@@ -12,7 +12,7 @@
 pd.options.display.max_colwidth = 100
 # from tqdm import tqdm
 __program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.2.1"
+__version__ = "2023.3.1"
 
 
 # Assembly
@@ -397,7 +397,7 @@ def main(args=None):
     parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter)
     # Pipeline
     parser_io = parser.add_argument_group('Required I/O arguments')
-    parser_io.add_argument("-m","--mags", type=str, help = "Tab-seperated value table of [id_mag]<tab>[path/to/genome.fasta]")
+    parser_io.add_argument("-m","--mags", type=str, help = "Either 1) Tab-seperated value table of [id_mag]<tab>[path/to/genome.fasta]; or 2) a file with a list filepaths for genomes (Must have same extension)")
     parser_io.add_argument("-g","--gene_models", type=str, help = "Tab-seperated value table of [id_mag]<tab>[path/to/gene_models.gff]. Must be gff3.")
     parser_io.add_argument("-o","--output_directory", type=str, default="veba_output/cluster", help = "path/to/project_directory [Default: veba_output/cluster]")
     parser_io.add_argument("--mags_extension", type=str, default="fa", help = "Fasta file extension for --mags if a list is provided [Default: fa]")

diff --git a/src/phylogeny.py b/src/phylogeny.py
@@ -13,7 +13,7 @@
 pd.options.display.max_colwidth = 100
 # from tqdm import tqdm
 __program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.2.1"
+__version__ = "2023.3.1"
 
 # Assembly
 def preprocess( input_filepaths, output_filepaths, output_directory, directories, opts):
@@ -105,7 +105,7 @@ def get_msa_cmd( input_filepaths, output_filepaths, output_directory, directorie
     cmd = [
 """
 # MUSCLE
-cut -f1 %s | %s -j %d 'echo "[MUSCLE] {}" && zcat "%s/{}.faa.gz" | %s -out %s/{}.msa %s'
+cut -f1 %s | %s -j %d 'echo "[MUSCLE] {}" && %s -%s %s/{}.faa -output %s/{}.msa %s -threads 1'
 
 # ClipKIT
 cut -f1 %s | %s -j %d 'echo "[ClipKIT] {}"  &&  %s %s/{}.msa -m %s -o %s/{}.msa.clipkit %s'
@@ -117,8 +117,9 @@ def get_msa_cmd( input_filepaths, output_filepaths, output_directory, directorie
     os.path.join(directories[("intermediate",  "1__hmmsearch")], "markers.tsv"),
     os.environ["parallel"],
     opts.n_jobs,
-    directories[("intermediate",  "1__hmmsearch")],
     os.environ["muscle"],
+    opts.alignment_algorithm,
+    directories[("intermediate",  "1__hmmsearch")],
     output_directory,
     opts.muscle_options,
 
@@ -142,33 +143,7 @@ def get_msa_cmd( input_filepaths, output_filepaths, output_directory, directorie
     os.path.join(output_directory, "alignment_table.boolean.tsv.gz"),
     )
     ]
-# """
-
-# for ID_MARKER in $(cut -f1 %s);
-#     do echo $ID_MARKER; 
-#     FAA=%s/${ID_MARKER}.faa.gz 
-
-#     # Run MUSCLE
-#     zcat ${FAA} | %s -out %s/${ID_MARKER}.msa %s
-
-
-#     # Run ClipKIT
-#     %s %s/${ID_MARKER}.msa -m %s -o %s/${ID_MARKER}.msa.clipkit %s
-
-#     done
-# """%(
-#     os.path.join(directories[("intermediate",  "1__hmmsearch")], "markers.tsv"),
-#     directories[("intermediate",  "1__hmmsearch")],
-#     os.environ["muscle"],
-#     output_directory,
-#     opts.muscle_options,
-#     os.environ["clipkit"],
-#     output_directory,
-#     opts.clipkit_mode,
-#     output_directory,
-#     opts.clipkit_options,
-#     ),
-#     ]
+
     return cmd
 
 # FastTree
@@ -542,6 +517,8 @@ def configure_parameters(opts, directories):
     if opts.hmmsearch_threshold.lower() == "e":
         opts.hmmsearch_threshold = None
     assert_acceptable_arguments(opts.hmm_marker_field, {"accession", "name"})
+    assert_acceptable_arguments(opts.alignment_algorithm, {"align", "super5"})
+
     # Set environment variables
     add_executables_to_environment(opts=opts)
 
@@ -560,7 +537,7 @@ def main(args=None):
     # Pipeline
     parser_io = parser.add_argument_group('Required I/O arguments')
     parser_io.add_argument("-d", "--database_hmm", type=str,  help=f"path/to/HMM database of markers")
-    parser_io.add_argument("-a","--proteins", type=str, help = "Can be the following format: 1) Tab-seperated value table of [id_mag]<tab>[path/to/protein.fasta]; 2) List of filepaths [path/to/protein.fasta]; or 3) Directory of protein fasta using --protein_extension")
+    parser_io.add_argument("-a","--proteins", type=str, help = "Can be the following format: 1) Tab-seperated value table of [id_mag]<tab>[path/to/protein.fasta]; 2) Files with list of filepaths [path/to/protein.fasta]; or 3) Directory of protein fasta using --extension")
     parser_io.add_argument("-o","--output_directory", type=str, default="veba_output/phylogeny", help = "path/to/project_directory [Default: veba_output/phylogeny]")
     parser_io.add_argument("-x", "--extension", type=str, default="faa", help = "Fasta file extension for proteins if a list is provided [Default: faa]")
 
@@ -582,6 +559,7 @@ def main(args=None):
 
     # Muscle
     parser_alignment = parser.add_argument_group('Alignment arguments')
+    parser_alignment.add_argument("-A", "--alignment_algorithm", type=str,  default="align", help = "Muscle alignment algorithm.  Align large input using Super5 algorithm if -align is too expensive. {align,super5} [Default: align]")
     parser_alignment.add_argument("-g", "--minimum_genomes_aligned_ratio", type=float,  default=0.95, help = "Minimum ratio of genomes include in alignment. This removes markers that are under represented. [Default: 0.95]")
     parser_alignment.add_argument("-m", "--minimum_markers_aligned_ratio", type=float,  default=0.2, help = "Minimum ratio of markers aligned. This removes genomes with few markers. Note, this is based on detected markers and NOT total markers in original HMM. [Default: 0.2]")
     parser_alignment.add_argument("--muscle_options", type=str, default="", help="MUSCLE | More options (e.g. --arg 1 ) [Default: '']")