An effective, reproducible and reliable data analysis workflow is based on the state-of-the-art pipeline, using the most up-to-date methods. In order to facilitate my future work, building the bioinformatics pipeline is necessary. However, how to construct the workflow is still hard for me. In my view, I wanna use the nextflow or snakemake program language to do it. Before doing it, what I need is to describe the workflows in the mindmap which could make me be clear.
- amplicon sequencing analysis
- metagenomics sequencing analysis
- bulk-RNA sequencing analysis
- DNA methylation by Illumina Array
The workflow comprises of two parts, one is from raw data to profile, and the other is data analysis (statistical analysis)
- the first parts
- demultiplex sequences;
- scan the quality of reads;
- filter the low quality reads and remove host DNA sequence
- align the high quality reads into the reference database
- obtain the profile whose structure is
$M x N$ matrix (M: features' name; N: sampleid) - the second part
- statistical analysis such as wilcoxon rank sum test, LDA, PCoA, linear regression analysis and multivariables association analysis
- machine learning
First and foremost, I utilize the perl program language to do some preliminary work and finally convert all the workflows into snakemake or nextflow.