This repository contains cached data and processing steps for day 2 of the symposium. This is split into two major pipelines: [1] Obtaining data from the BioML data set and processing it into assemblies and [2] gnerating carveME reconstructions for all assemblies with decent GTDB assignments.
Required compute: About 1000 CPU hours
This is all wrapped into a nextflow pipeline which is provided along with this repository: assemly.nf. There is conda environment file to set up all required dependencies. it covers the following steps.
- Downloading the first 1000 isolate genomes from the BioML paper.
- Quality filtering and trimming with FASTP.
- Assembly with MEGAHIT.
- Taxonomic placement with the GTDB toolkit.
After that the data is curated by hand to remove isolates with no clear GTDB bacterial assignment. This is contained in an Rstudio notebook. This will leave a little less than 980 assemblies.
This done using the Gibbons Lab model builder pipeline. The required media are provided in the repository as well.