Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem in bin refinement step #14

Open
shankhanath opened this issue Jan 9, 2025 · 9 comments
Open

Problem in bin refinement step #14

shankhanath opened this issue Jan 9, 2025 · 9 comments

Comments

@shankhanath
Copy link

I am using Metagenomic data. I have a 120 core server system.

Please see the error below. I have encountered problem in bin refinement step.
In the log file it is saying
"Skipping nanophase-out/02-LongBins/INITIAL_BINNING/metabat2/metabat2-bins//bin.22.fa because the bin size is not between 50kb and 20Mb"
or
"Skipping nanophase-out/02-LongBins/INITIAL_BINNING/maxbin2/maxbin2-bins//bin.001.fasta because the bin size is not between 50kb and 20Mb"
and then
"there are 0 bins in binsB"
"Please provide valid input. Exiting..."

(nanophase) [wsjuly24@ndwor06 DFU34_output]$ nanophase meta -l combined.fastq -t 80 -o nanophase-out
[2025-01-09 17:05:15] INFO: nanophase (meta) starts
[2025-01-09 17:05:15] INFO: Command line: /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/nanophase meta -l combined.fastq -t 80 -o nanophase-out
[2025-01-09 17:05:15] INFO: long_read_only model was selected, only Nanopore long reads will be used
[2025-01-09 17:05:15] CHECK: Nanopore long-read (fastq) file has been found
[2025-01-09 17:05:15] CHECK: Check software availability and locations
[2025-01-09 17:05:16] INFO: The following packages have been found
#package             location
nanophase            /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/nanophase
flye                 /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/flye
metabat2             /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/metabat2
maxbin2              /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/run_MaxBin.pl
SemiBin              /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/SemiBin
metawrap             /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/metawrap
checkm               /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/checkm
racon                /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/racon
medaka               /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/medaka
polypolish           /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/polypolish
POLCA                /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/polca.sh
bwa                  /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/bwa
seqtk                /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/seqtk
minimap2             /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/minimap2
BBMap                /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/BBMap
parallel             /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/parallel
perl                 /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/perl
samtools             /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/samtools
gtdbtk               /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/gtdbtk
fastANI              /data/sata_data/home/wsjuly24/miniconda3/envs/nanophase/bin/fastANI
All required packages have been found in the environment. If the above certain packages integrated into nanophase were used in your investigation, please give them credit as well :)
[2025-01-09 17:05:16] TASK: Long-read assembly starts (be patient)
[2025-01-09 17:56:07] DONE: long-read assembly finished successfully: detailed log file is nanophase-out/01-LongAssemblies/flye.log
[2025-01-09 17:56:08] TASK: Initial binning::metabat2 binning starts
[2025-01-09 17:58:54] DONE: Initial binning::metabat2 binning finished successfully
MetaBAT 2 (v2.12.1) using minContig 2500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200.
41 bins (121883604 bases in total) formed.
[2025-01-09 17:58:54] TASK: Initial binning::maxbin2 binning starts
[2025-01-09 18:00:16] DONE: Initial binning::maxbin2 binning finished successfully
Yielded 2 bins for contig (scaffold) file nanophase-out/01-LongAssemblies/assembly.fasta
Here are the output files for this run.
Please refer to the README file for further details.
Summary file: nanophase-out/02-LongBins/INITIAL_BINNING/maxbin2/bin.summary
Marker counts: nanophase-out/02-LongBins/INITIAL_BINNING/maxbin2/bin.marker
Marker genes for each bin: nanophase-out/02-LongBins/INITIAL_BINNING/maxbin2/bin.marker_of_each_gene.tar.gz
[2025-01-09 18:00:16] TASK: Initial binning::SemiBin binning starts
[2025-01-09 18:06:54] DONE: Initial binning::SemiBin binning finished successfully
SemiBin recovered  4 bins
If you find SemiBin useful, please cite:
Pan, S.; Zhu, C.; Zhao, XM.; Coelho, LP. A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments. Nat Commun 13, 2326 (2022). https://doi.org/10.1038/s41467-022-29843-y.
[2025-01-09 18:06:59] TASK: bin refinement starts
[2025-01-09 18:07:00] ERROR: Something wrong with bin refinement, please also check nanophase-out/02-LongBins/BIN_REFINEMENT/bin_refinement.log, terminating...
(nanophase) [wsjuly24@ndwor06 DFU34_output]$

I have also added the bin_refinement.log file here..
bin_refinement.log

Please help

@Hydro3639
Copy link
Owner

Hello,

It seems that no candidate bins were successfully generated using MaxBin2, as they were either too small (<50 Kb) or too large (>20 Mb). This is why the bin refinement step exited. If you had sufficient sequencing coverage, my best guess is that the metagenome has very low microbial diversity.

Here is my suggestion: you can try bin refinement using MetaBAT2 and SemiBin with the following command:

metawrap bin_refinement -o nanophase-out/02-LongBins/BIN_REFINEMENT -c 50 -x 10 -t 10 -A nanophase-out/02-LongBins/INITIAL_BINNING/metabat2/metabat2-bins/ -B nanophase-out/02-LongBins/INITIAL_BINNING/semibin/semibin-bins > nanophase-out/02-LongBins/BIN_REFINEMENT/bin_refinement.log

If it works, you can ignore the previous error and re-run the same command. If it does not work, try checking the input long reads (accuray/length/etc).

Best

@shankhanath
Copy link
Author

Hi,
Thank you for this.. everything ran successfully.. however at the end the following error came.. can you help me in this?

[2025-01-11 12:58:23] DONE: genome quality assessment finished successfully. Now, go to the next stage: genome classification
[2025-01-11 12:58:30] TASK: Genome taxa classification starts
[2025-01-11 12:58:33] INFO: GTDB-Tk v2.3.2
[2025-01-11 12:58:33] INFO: gtdbtk classify_wf --genome_dir nanophase-out/03-Polishing/Final-bins/ -x fasta --out_dir nanophase-out/03-Polishing/Final-bins/tmp --cpus 80 --skip_ani_screen
[2025-01-11 12:58:33] ERROR: Controlled exit resulting from early termination.
[2025-01-11 12:58:33] DONE: genome classification done
cat: 'nanophase-out/03-Polishing/Final-bins/tmp/classify/gtdbtk.*summary.tsv': No such file or directory
[2025-01-11 12:58:33] ERROR: Something wrong with GTDB::Taxa process, terminating...

Best

@Hydro3639
Copy link
Owner

It seems that there are some errors in the taxonomy inference with GTDB-Tk. Could you please provide the log file for the GTDB-Tk step? It should be located in the folder nanophase-out/03-Polishing/Final-bins/tmp.
or you can simply re-run the following command to see what is on your screen

gtdbtk classify_wf --genome_dir nanophase-out/03-Polishing/Final-bins/ -x fasta --out_dir nanophase-out/03-Polishing/Final-bins/tmp --cpus 80 --skip_ani_screen

@shankhanath
Copy link
Author

Hi,

Thank you for the code..
I ran it..
It is saying the GTDB is corrupted..

[2025-01-13 16:01:03] INFO: gtdbtk classify_wf --genome_dir nanophase-out/03-Polishing/Final-bins/ -x fasta --out_dir nanophase-out/03-Polishing/Final-bins/tmp --cpus 80 --skip_ani_screen

================================================================================
                                     ERROR
________________________________________________________________________________

           The GTDB-Tk reference data does not exist or is corrupted.
                   GTDBTK_DATA_PATH=/path/to/release/package/

   Please compare the checksum to those provided in the download repository.
          https://github.com/Ecogenomics/GTDBTk#gtdb-tk-reference-data
================================================================================
[2025-01-13 16:01:03] ERROR: Controlled exit resulting from early termination.

@Hydro3639
Copy link
Owner

Now I understand what happened. It seems that the GTDB database was not set up for GTDB-Tk. You can refer to this section for instructions on how to set it up. Let me know if it doesn't work after the database setup:)

@shankhanath
Copy link
Author

HI,

Well I downloaded both the databases and everything was set up according to your referred section. I ran the code again. It went well initially but at the end error came due to fastani. I looked into the folder where gtdbtk was extracted i.e, "release220". Strangely I saw no folder called fastani.. What could go wrong.. I have properly downloaded and extracted the databases. Although I had to use different web url to download it as it has changed.

wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/auxillary_files/gtdbtk_package/full_package/gtdbtk_data.tar.gz && tar xvzf gtdbtk_data.tar.gz

The error is given below..

[2025-01-14 01:28:50] INFO: gtdbtk classify_wf --genome_dir nanophase-out/03-Polishing/Final-bins/ -x fasta --out_dir nanophase-out/03-Polishing/Final-bins/tmp --cpus 80 --skip_ani_screen
[2025-01-14 01:28:50] INFO: Using GTDB-Tk reference data version r220: /data/sata_data/home/wsjuly24/release220
[2025-01-14 01:28:50] INFO: Identifying markers in 1 genomes with 80 threads.
[2025-01-14 01:28:50] TASK: Running Prodigal V2.6.3 to identify genes.
[2025-01-14 01:28:53] INFO: Completed 1 genome in 2.57 seconds (2.57 seconds/genome).
[2025-01-14 01:28:54] TASK: Identifying TIGRFAM protein families.
[2025-01-14 01:29:01] INFO: Completed 1 genome in 7.18 seconds (7.18 seconds/genome).
[2025-01-14 01:29:01] TASK: Identifying Pfam protein families.
[2025-01-14 01:29:02] INFO: Completed 1 genome in 0.65 seconds (1.55 genomes/second).
[2025-01-14 01:29:02] INFO: Annotations done using HMMER 3.3.2 (Nov 2020).
[2025-01-14 01:29:02] TASK: Summarising identified marker genes.
[2025-01-14 01:29:02] INFO: Completed 1 genome in 0.01 seconds (75.78 genomes/second).
[2025-01-14 01:29:02] INFO: Done.
[2025-01-14 01:29:02] INFO: Aligning markers in 1 genomes with 80 CPUs.
[2025-01-14 01:29:02] INFO: Processing 1 genomes identified as bacterial.
[2025-01-14 01:29:09] INFO: Read concatenated alignment for 107,235 GTDB genomes.
[2025-01-14 01:29:09] TASK: Generating concatenated alignment for each marker.
[2025-01-14 01:29:13] INFO: Completed 1 genome in 0.02 seconds (55.44 genomes/second).
[2025-01-14 01:29:14] TASK: Aligning 93 identified markers using hmmalign 3.3.2 (Nov 2020).
[2025-01-14 01:29:19] INFO: Completed 93 markers in 0.44 seconds (213.08 markers/second).
[2025-01-14 01:29:19] TASK: Masking columns of bacterial multiple sequence alignment using canonical mask.
[2025-01-14 01:31:34] INFO: Completed 107,236 sequences in 2.26 minutes (47,517.51 sequences/minute).
[2025-01-14 01:31:35] INFO: Masked bacterial alignment from 41,084 to 5,035 AAs.
[2025-01-14 01:31:35] INFO: 0 bacterial user genomes have amino acids in <10.0% of columns in filtered MSA.
[2025-01-14 01:31:35] INFO: Creating concatenated alignment for 107,236 bacterial GTDB and user genomes.
[2025-01-14 01:32:09] INFO: Creating concatenated alignment for 1 bacterial user genomes.
[2025-01-14 01:32:09] INFO: Done.
[2025-01-14 01:32:10] WARNING: Setting pplacer CPUs to 64, as pplacer is known to hang if >64 are used. You can override this using: --pplacer_cpus
[2025-01-14 01:32:10] TASK: Placing 1 bacterial genomes into backbone reference tree with pplacer using 64 CPUs (be patient).
[2025-01-14 01:32:10] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
[2025-01-14 01:34:05] INFO: Calculating RED values based on reference tree.
[2025-01-14 01:34:06] INFO: 1 out of 1 have an class assignments. Those genomes will be reclassified.
[2025-01-14 01:34:06] TASK: Placing 1 bacterial genomes into class-level reference tree 7 (1/1) with pplacer using 64 CPUs (be patient).
[2025-01-14 01:35:49] INFO: Calculating RED values based on reference tree.
[2025-01-14 01:35:50] TASK: Traversing tree to determine classification method.
[2025-01-14 01:35:50] INFO: Completed 1 genome in 0.00 seconds (8,256.50 genomes/second).
[2025-01-14 01:35:50] ERROR: Reference genome missing from FastANI database: /data/sata_data/home/wsjuly24/release220/fastani/database/GCF/000/973/085/GCF_000973085.1_genomic.fna.gz
[2025-01-14 01:35:50] ERROR: Controlled exit resulting from an unrecoverable error or warning.

@Hydro3639
Copy link
Owner

Hi, it might be my mistake. The last time I updated nanophase, I upgraded the GTDB database from R207 to R214. However, the latest version available today is R220, which has a different structure. Unfortunately, if you want to use the current version of nanophase, you will need to download R214 and configure the path accordingly. I apologize for the inconvenience. the commands that you may refer to:

wget https://data.gtdb.ecogenomic.org/releases/release214/214.1/auxillary_files/gtdbtk_r214_data.tar.gz && tar xvzf gtdbtk_data.tar.gz
echo "export PLSDB_PATH=/path/to/plsdb.fna" >> $(dirname $(dirname `which nanophase`))/etc/conda/activate.d/np_db.sh
conda deactivate && conda activate nanophase

@shankhanath
Copy link
Author

Yes, Thanks a ton for it. I am running the code and it is downloading. I will surely update you after running the codes.

Can you help me by understanding one thing. Nanophase is very good for ONT fastq reads. But I also have many shotgun metagenome sequenced (2x250bp) samples. I have done assembly binning using IDBA-UD and MEGAHIT. Can those assembled contigs be used in any stage of this Nanophase pipeline. I actually similarly wanted to perform MAG based analysis from shotgun metagenome data. But not finding any comprehensive single pipeline for that. Nanophase is really a comprehensive pipeline for ONT fastqs.

Can you help me on this?

@Hydro3639
Copy link
Owner

No problem:)

If the shotgun metagenome reads come from the same sample as the long reads, you can use nanophase with the --hybrid option. However, I believe that might not be the case here. If I understand correctly, you're looking to recover MAGs from shotgun metagenome data, in which case I recommend trying MetaWRAP.

If you already have MAGs, they can technically be used in the nanophase pipeline, but I wouldn't suggest that approach. Instead, I recommend processing them separately: you can run CheckM/CheckM2 for quality assessment, GTDB-Tk for taxonomic assignment and phylogenetic tree construction, DRAM/Bakta for quick genome annotations, CoverM/sylph for abundance estimation and etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants