The repository stores scripts for the project of "TYMEFLIES_Viral" - Study of the viral population based on 20-year time series metagenome data from Lake Mendota, Madison, WI, US (metagenomes are obtained from lake water from pelagic integrated epilimnion zone).
Scripts (including some of the inputs/outputs) are placed in the following folders:
1 Process the datasets: Copy fastq file, calculate fastq statistics, and get all metagenome assemblies cov state (or depth) files
2 Identify phages and find active prophages: Identify phages by VIBRANT, find active prophage by PropagAtE, and run CheckV to get phage scaffold quality
(This part is mainly based on the usage of software VIBRANT, PropagAtE, and CheckV)
3 Reconstruct vMAGs: Recontruct vMAGs using vRhyme, get the best set of phage bins using stringent criteria, run CheckV to get phage vMAG quality, and summarize AMGs for all metagenomes
(This part is mainly based on the usage of software vRhyme; we also made a custom script to get the best set of phage bins)
4 Cluster phage genomes: Cluster vMAGs into vOTUs at family-, genus-, and species-level
[The original phage vOTU clustering methods were adopted from two previously published papers: 1) Nat Microbiol. 2021 Jul;6(7):960-970. 2) Nucleic Acids Res. 2021 Jan 8;49(D1):D764-D775. While, since the large number of viral genomes (~1.3 million genomes) in this study, we firstly clustered genomes into family- and genus-level vOTUs using MCL-based method (we also modified the original python script within to reduce the RAM demand for our case), then used dRep to get species-level vOTUs within each genus.]
5 Taxonomic_classification: Classify phage genomes using two methods: NCBI RefSeq viral protein searching and VOG HMM marker searching
(This part is mainly based on the method in Nucleic Acids Res. 2021 Jan 8;49(D1):D764-D775.)
6 Host prediction: Predict host using three approaches: 1) iPHoP-based prediction; 2) prophage scaffold search; 3) match to AMG (auxiliary metabolic gene)
7 Rscript for visualization: Rscripts for a variety of visualization works
8 Mapping metagenomic assemblies: Map reads to metagenomic assemblies (the original scaffolds including both microbial and viral ones) to get MAG/virus abundance
9 Time series analysis - Part 1: Conduct AMG ratio and viral genome coverage analysis
10 Time series analysis - Part 2: Get four important AMG-containing viral genome coverage statistics
11 Time series analysis - Part 3: Get microdiversity analysis results
12 Time series analysis - Part 4: Conduct virus and MAG taxa association analysis
13 Metatranscriptome analysis: Conduct metatranscriptome analysis using different mapping references to see the gene expression pattern
14 [Miscellaneous scripts](https://github.com/AnantharamanLab/TYMEFLIES_Viral/tree/main/Miscellaneous scripts): Contain various auxiliary scripts that were used within the whole project
15 Environmental parameter: Contain the organized tables, original dataset sources, and scripts parsing original datasets
Database processing scripts are placed in the following folders:
1 Database IMGVR : IMG/VR database v4.1 release Dec. 2022 (for Cluster phage genomes)
2 Database NCBI RefSeq viral: NCBI RefSeq viral (2023-01-13 release) (for Taxonomical classification)
3 Database VOG97: VOG97 HMMs Release date Apr 19, 2021 (for Taxonomical classification)
4 Database TYMEFLIES MAGs: MAGs in IMG platform (for Host prediction)