Adding the new Greengenes2 database for classification #658

aimirza · 2023-11-09T17:19:18Z

Description of feature

Greengenes2 recently came out. Greengenes2 is a new release of the Greengenes database that has been redesigned from the ground up and backed by whole genomes, focusing on harmonizing 16S rRNA and shotgun metagenomic datasets. It is also
much larger than past resources in its phylogenetic coverage, as compared to SILVA, Greengenes and GTDB. It would be great to add this database as an optional feature for classifying sequences. Usage instructions are below. It has a QIIME 2 plugin. Notice that the approaches to classify sequences is different between V4 and non-V4 sequences.

Paper: https://www.nature.com/articles/s41587-023-01845-1
How to use it: https://forum.qiime2.org/t/introducing-greengenes2-2022-10/25291

d4straub · 2023-11-10T07:25:44Z

Hi there,
yes that is an interesting database indeed. I dislike however that its very much centered on QIIME2 and the V4-region. GTDB also allows harmonizing between 16S and shotgun metagenomics and that is available in ampliseq & mag already.

Greengenes2 was discussed in https://nfcore.slack.com/archives/CEA7TBJGJ/p1690539708378009 & https://nfcore.slack.com/archives/CEA7TBJGJ/p1678204777328909. Using --skip_dada_taxonomy --classifier http://ftp.microbio.me/greengenes_release/current/2022.10.backbone.full-length.nb.qza might do the job (not tested!). Feedback would be appreciated.
Otherwise preprocessing the database with QIIME2 v2023.7 (that is used in ampliseq v2.7.0) and providing the classifier to the pipeline with --classifier should work currently.

I hope for the integration of Greengenes2 for DADA2 classifications, that should solve all preprocessing and make the db integration relatively easy to add here, including an upload to Zenodo which is much preferred to a university DB. Greengenes2 was said to be "soon-ish" provided as DADA2 database in Zenodo, see benjjneb/dada2#1680 and benjjneb/dada2#1829.

d4straub · 2024-01-12T08:04:45Z

Greengenes2 support is now for QIIME2 available in the dev branch and will be in the next release. I dont close that issue though because there is still no news for DADA2 (or I missed it).

aimirza · 2024-08-19T18:18:29Z

Hi Daniel. Thank you for the update. I do see greengenes2 in the github page but it doesnt show as one of the parameter options on your nextflow page. Just informing you. I havent used it yet but I plan to very soon. Best regards, Ali

…

On Fri, Jan 12, 2024 at 12:04 AM Daniel Straub ***@***.***> wrote: Greengenes2 support is now for QIIME2 available in the dev branch and will be in the next release. I dont close that issue though because there is still no news for DADA2 (or I missed it). — Reply to this email directly, view it on GitHub <#658 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEFFT7KL5RQQF4SUYKTCDKTYODVCPAVCNFSM6AAAAAA7E5LGKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBYGYYTEOJYGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- *Ali Mirza, Ph.D.* IMPACTT Bioinformatician at Simon Fraser University, Burnaby, Canada Email: ***@***.*** <https://www.mail.ubc.ca/owa/redir.aspx?C=IFx7SlyOwIiDLsCxgwvbmXMj4tAPRlHKUpt0pxg2k7ZC40YDoZbVCA..&URL=mailto%3aamirza%40alumni.ubc.ca> or ***@***.*** LinkedIn URL: www.linkedin.com/in/ali-i-mirza

d4straub · 2024-09-02T08:23:43Z

Hi @aimirza ,

it seems that greengenes2 is an option for --qiime_ref_taxonomy as in https://nf-co.re/ampliseq/2.11.0/parameters/#qiime_ref_taxonomy... where would you expect to appear "greengenes2" as option where it doesnt?

aimirza · 2024-09-02T15:20:10Z

My mistake, I was looking at --dada_ref_taxonomy .

aimirza · 2024-09-02T15:23:38Z

Paper: https://www.nature.com/articles/s41587-023-01845-1 How to use it: https://forum.qiime2.org/t/introducing-greengenes2-2022-10/25291
How to use it: https://forum.qiime2.org/t/introducing-greengenes2-2022-10/25291

How are you using qiime2 to classify ASVs with the greengenes2 database? Are you following the 'How to use it' guidelines from the link you shared or are you using a pre-trained classifier?

d4straub · 2024-09-03T09:01:33Z

How are you using qiime2 to classify ASVs with the greengenes2 database?

The following files are used

ampliseq/conf/ref_databases.config

Lines 373 to 378 in 0473e15

    
           'greengenes2' { 
        
               title = "Greengenes2 16S - Version 2022.10" 
        
               file = [ "http://ftp.microbio.me/greengenes_release/2022.10/2022.10.seqs.fna.gz", "http://ftp.microbio.me/greengenes_release/2022.10/2022.10.taxonomy.md5.tsv.gz" ] 
        
               citation = "McDonald, D., Jiang, Y., Balaban, M. et al. Greengenes2 unifies microbial data in a single reference tree. Nat Biotechnol (2023). https://doi.org/10.1038/s41587-023-01845-1" 
        
               fmtscript = "taxref_reformat_qiime_greengenes2022.sh" 
        
           }

to extract sequences with primers and train the classifier.

aimirza · 2024-09-04T01:17:51Z

Wow, extracting reads (QIIME2_EXTRACT) takes a long time. It ran for a day and got canceled because of the default 1 day limit. I increased the limit and now waiting. Since it takes so long, it would be nice to have the option to use qiime2's simple and super quick classification method for V4 regions, which is to set intersection between the ASVs and what exists in the database. No training or classifiers needed. The issue with this approach is that ASVs not found in the database wont be classified. But most ASVs should get classified, they say.

d4straub · 2024-09-04T06:56:23Z

QIIME2_EXTRACT is running 8h 58m on our hpc

yes, I tested it and it takes long, check out #666 (comment)
It was implemented that way because it caters to every use case, not just V4. If you want to implement and open a PR with the super quick classification method, that would be nice ofc.

aimirza · 2024-09-06T02:28:43Z

Changing the time limit doesn't seem to work properly. I supplied new config rules to the -c parameter, such as:

process {
  
  withName:QIIME2_EXTRACT {
      cpus   = 2
      memory = 42.GB
      time   = 500.h
      }

}

I also tried the codes below, but it still failed after 1 day:

process {

    cpus   = 2
    memory = 42.GB
    time   = 500.h

}

d4straub · 2024-09-06T06:59:55Z

What about the cpus and memory, are they altered successfully? If yes, check your --max_time setting, maybe another config is overwriting it?

aimirza · 2024-09-06T14:38:08Z

I also had set --max_time to 500h.
Below is my sbatch script:

#!/bin/bash -l
#SBATCH --time=3-12:00:00
#SBATCH --nodes=4
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=4G          
#SBATCH --error=%x-%j.error
#SBATCH --output=%x-%j.out


nextflow run main.nf \
        -profile singularity \
        -c /home/amirza/projects/def-sponsor01/data_share/ampliseq/gg2.config \
        --input_fasta ./results_full2/dada2/ASV_seqs.fasta \
        --FW_primer GTGYCAGCMGCCGCGGTAA \
        --RV_primer GGACTACNVGGGTWTCTAAT \
        --metadata "Metadata_rename_with_batch_info.tsv" \
        --outdir ./test_gg2 \
        --ignore_empty_input_files \
        --ignore_failed_trimming \
        --qiime_ref_taxonomy greengenes2 \
        --skip_dada_taxonomy \
        --skip_qiime_downstream \
        --validate_params \
        --max_cpus 8 \
        --max_memory 84.GB \
        --max_time 500h \
        --skip_barrnap \
        --skip_fastqc \
        -resume

I also don't see multiple jobs running at the same time. The only related parameters I see listed in the log file is --max_cpus, --max_memory and --max_time.

The supplied config file (gg2.config) is:

process {

  withName:QIIME2_EXTRACT {
      cpus   = 8
      memory = 12.GB
      time   = 500.h
      }

}

N E X T F L O W ~ version 23.04.3
nf-core/ampliseq v2.10.0

aimirza · 2024-09-07T14:11:39Z

I think I got It to work after increasing the number of CPUs but now I have another problem. Apparently It is running out of space "[Errno 28] No space left on device" when running QIIME2_PREPTAX:QIIME2_TRAIN, even though I have 3TB left on my device. Any idea what the issue is?

d4straub · 2024-09-09T06:39:26Z

Hi there, this is going way out of the scope of this issue (adding gg2 database). Your problems are not related to gg2, but to executing a large job on your hpc. The error on too less space is most likely related your hpc setting for tmp/scratch data, please contact you sys admin.

aimirza · 2024-09-09T21:30:41Z

I need to know a couple things about using the gg2 database. When using the process QIIME2_TRAIN on the gg2 database, which is a high process job with 1 cpu, what is the minimum memory it requires? Second, where are the tmp files being stored when running QIIME2_TRAIN? The log says "Debug info has been saved to /tmp/qiime2-q2cli-err-cegyux3s.log" but no such file exists in that directory, nor is it in the tmp directory TMPDIR I defined before running the pipeline.

aimirza · 2024-09-10T01:07:23Z

To reduce memory usage, I will add the parameter --p-classify--chunk-size 10000 (default 20000) to the qiime feature-classifier fit-classifier-naive-bayes plugin in the modules/local/qiime2_train.nf module. I'll let you know if it works.

aimirza · 2024-10-03T00:34:44Z

QIIME2_TRAIN is dumping data into /tmp directory. The process is putting files into /tmp and not into my specified /scratch directory. The issue is likely due to the /tmp folder and not cpu memory (which was set to 86.GB). Each compute node has limited space since they are primarily used for computation. Most of the storage is located in our separate directories under /scratch. It's strange because we’ve already set the tmp folder for Nextflow to /scratch/group_share/tmp/. Ive also set the following tmp directories in the script before running the pipline:

export TMPDIR="/scratch/path/to/directory/"
export TEMP="/scratch/path/to/directory/"
export TMP="/scratch/path/to/directory/"
export QIIME2_TMPDIR="/scratch/group_share/tmp/amirza/data/"
export JOBLIB_TEMP_FOLDER="/scratch/group_share/tmp/amirza/data/"

export NXF_WORK="/scratch/group_share/nextflow_workdir/amirza"
export NXF_TEMP="/scratch/group_share/tmp/amirza/data/"
export SINGULARITY_TMPDIR="/scratch/group_share/tmp/amirza/"
export NXF_SINGULARITY_CACHEDIR="/scratch/group_share/singularity_imgs/"
export APPTAINER_TMPDIR="/scratch/group_share/tmp/amirza/"
export APPTAINERENV_TMPDIR="/scratch/group_share/tmp/amirza/"
export SINGULARITYENV_TMPDIR="/scratch/group_share/tmp/amirza/"
export SINGULARITY_CACHEDIR="/scratch/group_share/singularity_imgs/"

None of that worked, but... I finally got it to work after about weeks of trying, HURRAY!!

To address the issue, I created an additional configuration file that includes the following adjustments passed as a file to the parameter -c:

Binding the /scratch Directory in Singularity:

singularity {
    runOptions = '--bind /scratch/group_share/tmp/:/scratch/group_share/tmp/'
}

This command explicitly binds the /scratch/group_share/tmp/ directory to the same path within the Singularity container. By binding this directory, any temporary files created by the qiime2 process/plugins are directed to the larger storage area in /scratch rather than the limited local /tmp directory on the compute nodes.

Setting Environment Variables for Temporary Directories:

process {
    withName: 'QIIME2_TRAIN' {
        scratch = true
        
        // Set environment variables explicitly
        env.TMPDIR = '/scratch/group_share/tmp/amirza/data/'
        env.TEMP = '/scratch/group_share/tmp/amirza/data/'
        env.TMP = '/scratch/group_share/tmp/amirza/data/'
        env.QIIME2_TMPDIR = '/scratch/group_share/tmp/amirza/data/'
        env.JOBLIB_TEMP_FOLDER = '/scratch/group_share/tmp/amirza/data/'

        env.SINGULARITY_CACHEDIR = '/scratch/group_share/singularity_imgs/'
        env.APPTAINER_CACHEDIR = '/scratch/group_share/singularity_imgs/'
    }
}

These environment variables (TMPDIR, TEMP, TMP, QIIME2_TMPDIR, and JOBLIB_TEMP_FOLDER) might be used by various tools (such as qiime2 plugins) and processes to define where temporary files are stored. By setting these explicitly to /scratch/group_share/tmp/amirza/data/, I redirected the storage of temporary files from the limited /tmp directory to a designated area with sufficient space.

Additionally, setting SINGULARITY_CACHEDIR and APPTAINER_CACHEDIR ensures that the container caching mechanisms also use the allocated /scratch space, avoiding the use of local directories that might be space-constrained.

Would you know which specific changes likely fixed the problem?

With 8 cpus of 10GB each, I finished classifying my ASVs in 20 hours.

d4straub · 2024-10-07T06:43:41Z

Thanks for detailing the solution!
Did you figure out whether singularity runOptions was needed in addition to all the TMP and CACHEDIR settings, or just the latter?

aimirza · 2024-10-08T14:07:11Z

Actually, binding Singularity to the specified directory using singularity { runOptions... was essential and specifying TMP and TMPDIR was not enough. I discovered that the variables (QIIME2_TMPDIR, JOBLIB_TEMP_FOLDER, etc.) were not defined and not used by the container by adding a check in the QIIME2_EXTRACT script that printed whether each variable was set.

aimirza added the enhancement New feature or request label Nov 9, 2023

This was referenced Nov 29, 2023

Add greengenes2 2022.10 support to Ampliseq #664

Closed

Greengenes2 2022.10 Support #666

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding the new Greengenes2 database for classification #658

Adding the new Greengenes2 database for classification #658

aimirza commented Nov 9, 2023

d4straub commented Nov 10, 2023

d4straub commented Jan 12, 2024

aimirza commented Aug 19, 2024 via email

d4straub commented Sep 2, 2024

aimirza commented Sep 2, 2024

aimirza commented Sep 2, 2024

d4straub commented Sep 3, 2024

aimirza commented Sep 4, 2024

d4straub commented Sep 4, 2024

aimirza commented Sep 6, 2024

d4straub commented Sep 6, 2024

aimirza commented Sep 6, 2024

aimirza commented Sep 7, 2024 •

edited

Loading

d4straub commented Sep 9, 2024

aimirza commented Sep 9, 2024 •

edited

Loading

aimirza commented Sep 10, 2024 •

edited

Loading

aimirza commented Oct 3, 2024 •

edited

Loading

d4straub commented Oct 7, 2024

aimirza commented Oct 8, 2024 •

edited

Loading

Adding the new Greengenes2 database for classification #658

Adding the new Greengenes2 database for classification #658

Comments

aimirza commented Nov 9, 2023

Description of feature

d4straub commented Nov 10, 2023

d4straub commented Jan 12, 2024

aimirza commented Aug 19, 2024 via email

d4straub commented Sep 2, 2024

aimirza commented Sep 2, 2024

aimirza commented Sep 2, 2024

d4straub commented Sep 3, 2024

aimirza commented Sep 4, 2024

d4straub commented Sep 4, 2024

aimirza commented Sep 6, 2024

d4straub commented Sep 6, 2024

aimirza commented Sep 6, 2024

aimirza commented Sep 7, 2024 • edited Loading

d4straub commented Sep 9, 2024

aimirza commented Sep 9, 2024 • edited Loading

aimirza commented Sep 10, 2024 • edited Loading

aimirza commented Oct 3, 2024 • edited Loading

d4straub commented Oct 7, 2024

aimirza commented Oct 8, 2024 • edited Loading

aimirza commented Sep 7, 2024 •

edited

Loading

aimirza commented Sep 9, 2024 •

edited

Loading

aimirza commented Sep 10, 2024 •

edited

Loading

aimirza commented Oct 3, 2024 •

edited

Loading

aimirza commented Oct 8, 2024 •

edited

Loading