Snakemake pipeline to assemble Oxford Nanopore Technology reads. This pipeline was originally applied to P. aeruginosa isolates, however the pipeline will also work for other isolates. You will need Guppy demultiplexed and basecalled FASTQ files to begin.
By default, this workflow returns Homopolished assemblies. Isolates with long and short read data can also be assembled into Polypolished assemblies.
Prior to polishing with Homopolish or Polypolish, this workflow:
- Filters reads using Filtlong;
- Assembles filtered reads using Flye;
- Polishes the Flye assembly using Medaka.
If using Polypolish, short reads are filtered using Fastp.
In order to start, clone this repository to your local environment and navigate into the new directory:
git clone https://github.com/ezherman/assemble-cf-isolates
cd assemble-cf-isolates
You will need a Snakemake conda environment to begin. To speed up the installation of conda environments,
I recommend including mamba
in this environment:
conda create -c conda-forge -c bioconda -n snakemake snakemake mamba -y
conda activate snakemake
When running the workflow for the first time, Snakemake takes care of creating the required conda environments (with the exception of one environment, see below). You can ask Snakemake to specifically set up the conda environments with the command below. This way, the duration of the first execution of the workflow will not be extended by conda environment installations:
snakemake -j1 --use-conda --conda-create-envs-only
Regardless of whether you run the command above, you will need to set up the Homopolish conda environment with the commands below. Homopolish cannot be installed currently by snakemake due to a dependencies issue (see here), so this is a temporary workaround:
conda config --set channel_priority flexible
conda env create -f workflow/envs/homopolish.yml
conda config --set channel_priority strict
Homopolish uses a mash sketch, which needs to be downloaded. This download may take a while, as the gzipped file is 3.9 GB. You can ask Snakemake to download the file as follows:
snakemake -j1 --use-conda download_homopolish_sketch
You will need to specify the version of Guppy that was used for basecalling.
You can edit the guppy_version
line in the config file (config/config.yaml
).
Additionally, depending on the version of Guppy that was used,
as well as the Guppy model used in basecalling, you will need to specify
a particular Medaka model. There are details on selecting the appropriate Medaka
model here. And you can
find a full list of the latest and older Medaka consensus models
here.
Once you have identified the appropriate Medaka model,
change the medaka_model
section of the config.yaml
file as appropriate.
For example, by default the model is set to r941_min_sup_g507
. This model
is appropriate for MinIon runs, with a R9.4.1 flowcell, basecalled with
Guppy >= 5.0.7 using the Super-Accuracy model. If your FASTQ files
were generated with Guppy 5.0.0 using the Fast model, then
you should change the model to r941_min_fast_g303
in the config.yaml
file.
Failure to change this setting appropriately may result in Medaka doing
a bad job at polishing your Flye-generated assemblies.
Finally, you will need to specify the model that is used by Homopolish.
By default, this is set to R9.4.pkl
. If your sequencing involved an
R10
flowcell, set the homopolish_model
parameter to R10.3.pkl
.
Each isolate should have its own directory inside data/long-read
, named
to identify that isolate (e.g. barcode02
). This
directory should contain the gzipped FASTQ file(s) for that isolate.
If you also have short-read data for this isolate, the files can be
stored in data/short-read
. For example, if five long-read FASTQ files
were associated with barcode02
, as well as two short-read FASTQ files,
then the data directory would have the following structure:
data
├── long-read
│ └── barcode02
│ ├── fastq_runid_afa204b7accb9021bb58baa5b294dc1a88ee3e9e_110_0.fastq.gz
│ ├── fastq_runid_afa204b7accb9021bb58baa5b294dc1a88ee3e9e_111_0.fastq.gz
│ ├── fastq_runid_afa204b7accb9021bb58baa5b294dc1a88ee3e9e_112_0.fastq.gz
│ ├── fastq_runid_afa204b7accb9021bb58baa5b294dc1a88ee3e9e_114_0.fastq.gz
│ └── fastq_runid_afa204b7accb9021bb58baa5b294dc1a88ee3e9e_117_0.fastq.gz
└── short-read
├── barcode02_1.fastq.gz
└── barcode02_2.fastq.gz
By default, the workflow will create Homopolished assemblies for
all isolate directories found in data/long-read
. This can be
achieved with the following command:
snakemake -j1 --use-conda
Note that you can increase -j1
to the number of cores available to you,
e.g. -j8
if you have eight available cores.
You can also explicitly specify that you want Homopolished assemblies:
snakemake -j1 --use-conda homopolish_workflow
Note that homopolish seems to require 16GB of RAM.
If you have short-read data for some or all of your isolates, you can request polypolished assemblies as follows:
snakemake -j1 --use-conda polypolish_workflow
You can obtain both homopolished and polypolished assemblies from the same Snakemake run. Note however that this does not return assemblies that have been both homopolished and polypolished. Instead, this command runs both of the workflows used above:
snakemake -j1 --use-conda homopolish_workflow polypolish_workflow
This workflow can be run on a SLURM cluster using the scheduler
and profile in the slurm
directory.
The following features need to be specified in the scheduler
script, slurm/scheduler.sh
:
--mail-user
, which will ensure that you are notified by email when the workflow ends or fails.--mail-type
and--mail-user
can be removed to disable this feature.--account
, for your project account.
The following features need to be specified in the profile config
file, slurm/slurm_profile/config.yaml
. Note that this config file
functions alongside the workflow config file, config/config.yaml
.
jobs
, which defines the maximum number of cores that snakemake can request at one time.- If you would like to be notified of the failure of individual
jobs (i.e. individual rule executions), you can add the following
to the quoted section of the
cluster
line:--mail-type=FAIL --mail-user [email protected]
.
After navigating to the slurm
directory and activating the
snakemake
conda environment (conda activate snakemake
),
run the following to execute the homopolish workflow:
sbatch scheduler_homopolish.sh
Snakemake will create a snakemake_%j.log
log file in the slurm
directory, through which you can monitor the progress of the workflow.
Log files for individual jobs will be written to the logs_slurm
directory.
To run the polypolish workflow, run the following:
sbatch scheduler_polypolish.sh
To run both workflows, run the following:
sbatch scheduler_all.sh