From f0b3803143bed0cb3047b20007b84bbc7d6e0558 Mon Sep 17 00:00:00 2001 From: Gregory Owens <5419829+owensgl@users.noreply.github.com> Date: Mon, 12 Feb 2024 20:40:22 -0800 Subject: [PATCH] Update lab_6.md --- labs/lab_6.md | 83 +++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 78 insertions(+), 5 deletions(-) diff --git a/labs/lab_6.md b/labs/lab_6.md index 3d8acf3dc..4e32e6861 100644 --- a/labs/lab_6.md +++ b/labs/lab_6.md @@ -55,13 +55,86 @@ ls ```output lab6_data.tar.gz data ``` -The data files we need +The files we need are in the data directory, lets take a look. + +```bash +cd data +ls +``` + +```output +SalmonReference.fasta SalmonSim.Stabilising.p10.i2.80000_R1.fastq.gz +SalmonSim.Stabilising.p10.i1.80000_R1.fastq.gz SalmonSim.Stabilising.p10.i2.80000_R2.fastq.gz +SalmonSim.Stabilising.p10.i1.80000_R2.fastq.gz +``` + +Here we have a miniature version of the chinook salmon reference genome, named "SalmonReference.fasta". +We also have sequencing data for two samples, SalmonSim.Stabilising.p10.i1.80000 and SalmonSim.Stabilising.p10.i2.80000. +Each of them have both forward and reverse reads. This name is fairly long and complicated because of +how I generated this sequence data, but this sort of complicated naming is often what you'll see. It's +better to have a long and descriptive name (e.g. "SpeciesX.Pop102.Sample53.Tissue39"), than a short non-descriptive +one ( e.g. "Sample_A"). When plotting the data, you can switch to easier to read sample names if necessary. + +Before we can start aligning, we need to index our reference genome. Last week we indexed our reference +genomes using samtools. The alignment program we're using also needs to index the reference genome. + +```bash +module load StdEnv/2023 +module load samtools/1.18 +module load bwa-mem2/2.2.1 + +samtools faidx SalmonReference.fasta +bwa-mem2 index SalmonReference.fasta +``` + +We now have index files for our reference genome. Take a look at them. In contrast to the .fai file we worked with +last lab, these files are not actually useful to us on their own. They're necessary for the alignment +program, but they're only machine readable. Regardless, lets align our first sample! + +```bash +bwa-mem2 mem -t 2 SalmonReference.fasta SalmonSim.Stabilising.p10.i1.80000_R1.fastq.gz SalmonSim.Stabilising.p10.i1.80000_R2.fastq.gz > SalmonSim.Stabilising.p10.i1.80000.sam +``` + +This will output a lot of text as it progresses. For most projects, this step will take minutes or hours, +but here we're working with a small dataset of simulated reads from a small genome, so it finishes +immediately. + +We now have our data in "sam" format. This stands for Sequence Alignment/Map format. In the sam file, +we have our base calls for each read, along with the quality score. This means a sam file has all the information +in our original fastq format, plus additional info! Specifically, the sam file contains information about +whether and where each read has mapped in the genome. + +```bash +head SalmonSim.Stabilising.p10.i1.80000.sam | less -S +``` + +```output +@SQ SN:chr_1 LN:5000000 +@SQ SN:chr_2 LN:5000000 +@PG ID:bwa-mem2 PN:bwa-mem2 VN:2.2.1 CL:bwa-mem2 mem -t 2 SalmonReference.fasta SalmonSim.Stabi> +chr_1_1_0_0 99 chr_1 4629813 60 126M = 4630544 857 CTCTGCCATGCTGTGTTGTCATGTGTTACTGTCA> +chr_1_1_0_0 147 chr_1 4630544 60 126M = 4629813 -857 TGGGGTCCAATAAAACCTAGTAGTATCTTATATT> +chr_1_1_1_0 99 chr_1 3307901 60 126M = 3308085 310 TTCACTATTTTGTGGTAGCTCTGAACTTGCTCAC> +chr_1_1_1_0 147 chr_1 3308085 60 126M = 3307901 -310 TGCTTGACGGATTACTTGGGGGGGGGGTTGACAC> +chr_1_1_2_0 99 chr_1 2893195 60 126M = 2894284 1215 CACCTGCCTTGTATTTCTCAGATTTCTATCTGAG> +chr_1_1_2_0 147 chr_1 2894284 60 126M = 2893195 -1215 TTGAGGCCTCATCTTCACTTTCACTTTCCAACCT> +chr_1_1_3_0 99 chr_1 4532319 60 126M = 4532871 678 TCAGTGCCTCTTTACTTGACATCATGGGAAAATC> +``` + +The first two lines are headers, which list all the contigs in the reference genome, along with their size. After that, +we have more header lines that specify all the programs and commands used to create this file. In this case, +it has the bwa-mem2 call (and specific information about the flags used for that command call). As we +modify the sam file, more lines will get added to the header recording the steps we did. This +is very helpful if you forget what you did exactly. + + + + + + + -#Download some data -#Download a reference genome -#Index your reference genome -#Align data #Look around bam file #Sort bam file. #Extract single region of bam file