Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the new Greengenes2 database for classification #658

Open
aimirza opened this issue Nov 9, 2023 · 19 comments
Open

Adding the new Greengenes2 database for classification #658

aimirza opened this issue Nov 9, 2023 · 19 comments
Labels
enhancement New feature or request

Comments

@aimirza
Copy link

aimirza commented Nov 9, 2023

Description of feature

Greengenes2 recently came out. Greengenes2 is a new release of the Greengenes database that has been redesigned from the ground up and backed by whole genomes, focusing on harmonizing 16S rRNA and shotgun metagenomic datasets. It is also
much larger than past resources in its phylogenetic coverage, as compared to SILVA, Greengenes and GTDB. It would be great to add this database as an optional feature for classifying sequences. Usage instructions are below. It has a QIIME 2 plugin. Notice that the approaches to classify sequences is different between V4 and non-V4 sequences.

Paper: https://www.nature.com/articles/s41587-023-01845-1
How to use it: https://forum.qiime2.org/t/introducing-greengenes2-2022-10/25291

@aimirza aimirza added the enhancement New feature or request label Nov 9, 2023
@d4straub
Copy link
Collaborator

Hi there,
yes that is an interesting database indeed. I dislike however that its very much centered on QIIME2 and the V4-region. GTDB also allows harmonizing between 16S and shotgun metagenomics and that is available in ampliseq & mag already.

Greengenes2 was discussed in https://nfcore.slack.com/archives/CEA7TBJGJ/p1690539708378009 & https://nfcore.slack.com/archives/CEA7TBJGJ/p1678204777328909. Using --skip_dada_taxonomy --classifier http://ftp.microbio.me/greengenes_release/current/2022.10.backbone.full-length.nb.qza might do the job (not tested!). Feedback would be appreciated.
Otherwise preprocessing the database with QIIME2 v2023.7 (that is used in ampliseq v2.7.0) and providing the classifier to the pipeline with --classifier should work currently.

I hope for the integration of Greengenes2 for DADA2 classifications, that should solve all preprocessing and make the db integration relatively easy to add here, including an upload to Zenodo which is much preferred to a university DB. Greengenes2 was said to be "soon-ish" provided as DADA2 database in Zenodo, see benjjneb/dada2#1680 and benjjneb/dada2#1829.

@d4straub
Copy link
Collaborator

Greengenes2 support is now for QIIME2 available in the dev branch and will be in the next release. I dont close that issue though because there is still no news for DADA2 (or I missed it).

@aimirza
Copy link
Author

aimirza commented Aug 19, 2024 via email

@d4straub
Copy link
Collaborator

d4straub commented Sep 2, 2024

Hi @aimirza ,

it seems that greengenes2 is an option for --qiime_ref_taxonomy as in https://nf-co.re/ampliseq/2.11.0/parameters/#qiime_ref_taxonomy... where would you expect to appear "greengenes2" as option where it doesnt?

@aimirza
Copy link
Author

aimirza commented Sep 2, 2024

My mistake, I was looking at --dada_ref_taxonomy .

@aimirza
Copy link
Author

aimirza commented Sep 2, 2024

Paper: https://www.nature.com/articles/s41587-023-01845-1 How to use it: https://forum.qiime2.org/t/introducing-greengenes2-2022-10/25291
How to use it: https://forum.qiime2.org/t/introducing-greengenes2-2022-10/25291

How are you using qiime2 to classify ASVs with the greengenes2 database? Are you following the 'How to use it' guidelines from the link you shared or are you using a pre-trained classifier?

@d4straub
Copy link
Collaborator

d4straub commented Sep 3, 2024

How are you using qiime2 to classify ASVs with the greengenes2 database?

The following files are used

'greengenes2' {
title = "Greengenes2 16S - Version 2022.10"
file = [ "http://ftp.microbio.me/greengenes_release/2022.10/2022.10.seqs.fna.gz", "http://ftp.microbio.me/greengenes_release/2022.10/2022.10.taxonomy.md5.tsv.gz" ]
citation = "McDonald, D., Jiang, Y., Balaban, M. et al. Greengenes2 unifies microbial data in a single reference tree. Nat Biotechnol (2023). https://doi.org/10.1038/s41587-023-01845-1"
fmtscript = "taxref_reformat_qiime_greengenes2022.sh"
}

to extract sequences with primers and train the classifier.

@aimirza
Copy link
Author

aimirza commented Sep 4, 2024

Wow, extracting reads (QIIME2_EXTRACT) takes a long time. It ran for a day and got canceled because of the default 1 day limit. I increased the limit and now waiting. Since it takes so long, it would be nice to have the option to use qiime2's simple and super quick classification method for V4 regions, which is to set intersection between the ASVs and what exists in the database. No training or classifiers needed. The issue with this approach is that ASVs not found in the database wont be classified. But most ASVs should get classified, they say.

@d4straub
Copy link
Collaborator

d4straub commented Sep 4, 2024

QIIME2_EXTRACT is running 8h 58m on our hpc

yes, I tested it and it takes long, check out #666 (comment)
It was implemented that way because it caters to every use case, not just V4. If you want to implement and open a PR with the super quick classification method, that would be nice ofc.

@aimirza
Copy link
Author

aimirza commented Sep 6, 2024

Changing the time limit doesn't seem to work properly. I supplied new config rules to the -c parameter, such as:

process {
  
  withName:QIIME2_EXTRACT {
      cpus   = 2
      memory = 42.GB
      time   = 500.h
      }

}

I also tried the codes below, but it still failed after 1 day:

process {

    cpus   = 2
    memory = 42.GB
    time   = 500.h

}

@d4straub
Copy link
Collaborator

d4straub commented Sep 6, 2024

What about the cpus and memory, are they altered successfully? If yes, check your --max_time setting, maybe another config is overwriting it?

@aimirza
Copy link
Author

aimirza commented Sep 6, 2024

I also had set --max_time to 500h.
Below is my sbatch script:

#!/bin/bash -l
#SBATCH --time=3-12:00:00
#SBATCH --nodes=4
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=4G          
#SBATCH --error=%x-%j.error
#SBATCH --output=%x-%j.out


nextflow run main.nf \
        -profile singularity \
        -c /home/amirza/projects/def-sponsor01/data_share/ampliseq/gg2.config \
        --input_fasta ./results_full2/dada2/ASV_seqs.fasta \
        --FW_primer GTGYCAGCMGCCGCGGTAA \
        --RV_primer GGACTACNVGGGTWTCTAAT \
        --metadata "Metadata_rename_with_batch_info.tsv" \
        --outdir ./test_gg2 \
        --ignore_empty_input_files \
        --ignore_failed_trimming \
        --qiime_ref_taxonomy greengenes2 \
        --skip_dada_taxonomy \
        --skip_qiime_downstream \
        --validate_params \
        --max_cpus 8 \
        --max_memory 84.GB \
        --max_time 500h \
        --skip_barrnap \
        --skip_fastqc \
        -resume

I also don't see multiple jobs running at the same time. The only related parameters I see listed in the log file is --max_cpus, --max_memory and --max_time.

The supplied config file (gg2.config) is:

process {

  withName:QIIME2_EXTRACT {
      cpus   = 8
      memory = 12.GB
      time   = 500.h
      }

}

N E X T F L O W ~ version 23.04.3
nf-core/ampliseq v2.10.0

@aimirza
Copy link
Author

aimirza commented Sep 7, 2024

I think I got It to work after increasing the number of CPUs but now I have another problem. Apparently It is running out of space "[Errno 28] No space left on device" when running QIIME2_PREPTAX:QIIME2_TRAIN, even though I have 3TB left on my device. Any idea what the issue is?

@d4straub
Copy link
Collaborator

d4straub commented Sep 9, 2024

Hi there, this is going way out of the scope of this issue (adding gg2 database). Your problems are not related to gg2, but to executing a large job on your hpc. The error on too less space is most likely related your hpc setting for tmp/scratch data, please contact you sys admin.

@aimirza
Copy link
Author

aimirza commented Sep 9, 2024

I need to know a couple things about using the gg2 database. When using the process QIIME2_TRAIN on the gg2 database, which is a high process job with 1 cpu, what is the minimum memory it requires? Second, where are the tmp files being stored when running QIIME2_TRAIN? The log says "Debug info has been saved to /tmp/qiime2-q2cli-err-cegyux3s.log" but no such file exists in that directory, nor is it in the tmp directory TMPDIR I defined before running the pipeline.

@aimirza
Copy link
Author

aimirza commented Sep 10, 2024

To reduce memory usage, I will add the parameter --p-classify--chunk-size 10000 (default 20000) to the qiime feature-classifier fit-classifier-naive-bayes plugin in the modules/local/qiime2_train.nf module. I'll let you know if it works.

@aimirza
Copy link
Author

aimirza commented Oct 3, 2024

QIIME2_TRAIN is dumping data into /tmp directory. The process is putting files into /tmp and not into my specified /scratch directory. The issue is likely due to the /tmp folder and not cpu memory (which was set to 86.GB). Each compute node has limited space since they are primarily used for computation. Most of the storage is located in our separate directories under /scratch. It's strange because we’ve already set the tmp folder for Nextflow to /scratch/group_share/tmp/. Ive also set the following tmp directories in the script before running the pipline:

export TMPDIR="/scratch/path/to/directory/"
export TEMP="/scratch/path/to/directory/"
export TMP="/scratch/path/to/directory/"
export QIIME2_TMPDIR="/scratch/group_share/tmp/amirza/data/"
export JOBLIB_TEMP_FOLDER="/scratch/group_share/tmp/amirza/data/"

export NXF_WORK="/scratch/group_share/nextflow_workdir/amirza"
export NXF_TEMP="/scratch/group_share/tmp/amirza/data/"
export SINGULARITY_TMPDIR="/scratch/group_share/tmp/amirza/"
export NXF_SINGULARITY_CACHEDIR="/scratch/group_share/singularity_imgs/"
export APPTAINER_TMPDIR="/scratch/group_share/tmp/amirza/"
export APPTAINERENV_TMPDIR="/scratch/group_share/tmp/amirza/"
export SINGULARITYENV_TMPDIR="/scratch/group_share/tmp/amirza/"
export SINGULARITY_CACHEDIR="/scratch/group_share/singularity_imgs/"

None of that worked, but... I finally got it to work after about weeks of trying, HURRAY!!

To address the issue, I created an additional configuration file that includes the following adjustments passed as a file to the parameter -c:

Binding the /scratch Directory in Singularity:

singularity {
    runOptions = '--bind /scratch/group_share/tmp/:/scratch/group_share/tmp/'
}

This command explicitly binds the /scratch/group_share/tmp/ directory to the same path within the Singularity container. By binding this directory, any temporary files created by the qiime2 process/plugins are directed to the larger storage area in /scratch rather than the limited local /tmp directory on the compute nodes.

Setting Environment Variables for Temporary Directories:

process {
    withName: 'QIIME2_TRAIN' {
        scratch = true
        
        // Set environment variables explicitly
        env.TMPDIR = '/scratch/group_share/tmp/amirza/data/'
        env.TEMP = '/scratch/group_share/tmp/amirza/data/'
        env.TMP = '/scratch/group_share/tmp/amirza/data/'
        env.QIIME2_TMPDIR = '/scratch/group_share/tmp/amirza/data/'
        env.JOBLIB_TEMP_FOLDER = '/scratch/group_share/tmp/amirza/data/'

        env.SINGULARITY_CACHEDIR = '/scratch/group_share/singularity_imgs/'
        env.APPTAINER_CACHEDIR = '/scratch/group_share/singularity_imgs/'
    }
}

These environment variables (TMPDIR, TEMP, TMP, QIIME2_TMPDIR, and JOBLIB_TEMP_FOLDER) might be used by various tools (such as qiime2 plugins) and processes to define where temporary files are stored. By setting these explicitly to /scratch/group_share/tmp/amirza/data/, I redirected the storage of temporary files from the limited /tmp directory to a designated area with sufficient space.

Additionally, setting SINGULARITY_CACHEDIR and APPTAINER_CACHEDIR ensures that the container caching mechanisms also use the allocated /scratch space, avoiding the use of local directories that might be space-constrained.

Would you know which specific changes likely fixed the problem?

With 8 cpus of 10GB each, I finished classifying my ASVs in 20 hours.

@d4straub
Copy link
Collaborator

d4straub commented Oct 7, 2024

Thanks for detailing the solution!
Did you figure out whether singularity runOptions was needed in addition to all the TMP and CACHEDIR settings, or just the latter?

@aimirza
Copy link
Author

aimirza commented Oct 8, 2024

Actually, binding Singularity to the specified directory using singularity { runOptions... was essential and specifying TMP and TMPDIR was not enough. I discovered that the variables (QIIME2_TMPDIR, JOBLIB_TEMP_FOLDER, etc.) were not defined and not used by the container by adding a check in the QIIME2_EXTRACT script that printed whether each variable was set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants