From f0b3803143bed0cb3047b20007b84bbc7d6e0558 Mon Sep 17 00:00:00 2001
From: Gregory Owens <5419829+owensgl@users.noreply.github.com>
Date: Mon, 12 Feb 2024 20:40:22 -0800
Subject: [PATCH] Update lab_6.md

---
 labs/lab_6.md | 83 +++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 78 insertions(+), 5 deletions(-)

diff --git a/labs/lab_6.md b/labs/lab_6.md
index 3d8acf3dc..4e32e6861 100644
--- a/labs/lab_6.md
+++ b/labs/lab_6.md
@@ -55,13 +55,86 @@ ls
 ```output
 lab6_data.tar.gz  data
 ```
-The data files we need 
+The files we need are in the data directory, lets take a look.
+
+```bash
+cd data
+ls
+```
+
+```output
+SalmonReference.fasta                           SalmonSim.Stabilising.p10.i2.80000_R1.fastq.gz
+SalmonSim.Stabilising.p10.i1.80000_R1.fastq.gz  SalmonSim.Stabilising.p10.i2.80000_R2.fastq.gz
+SalmonSim.Stabilising.p10.i1.80000_R2.fastq.gz
+```
+
+Here we have a miniature version of the chinook salmon reference genome, named "SalmonReference.fasta". 
+We also have sequencing data for two samples, SalmonSim.Stabilising.p10.i1.80000 and SalmonSim.Stabilising.p10.i2.80000.
+Each of them have both forward and reverse reads. This name is fairly long and complicated because of 
+how I generated this sequence data, but this sort of complicated naming is often what you'll see. It's
+better to have a long and descriptive name (e.g. "SpeciesX.Pop102.Sample53.Tissue39"), than a short non-descriptive
+one ( e.g. "Sample_A"). When plotting the data, you can switch to easier to read sample names if necessary.
+
+Before we can start aligning, we need to index our reference genome. Last week we indexed our reference
+genomes using samtools. The alignment program we're using also needs to index the reference genome. 
+
+```bash
+module load StdEnv/2023
+module load samtools/1.18
+module load bwa-mem2/2.2.1
+
+samtools faidx SalmonReference.fasta
+bwa-mem2 index SalmonReference.fasta
+```
+
+We now have index files for our reference genome. Take a look at them. In contrast to the .fai file we worked with 
+last lab, these files are not actually useful to us on their own. They're necessary for the alignment 
+program, but they're only machine readable. Regardless, lets align our first sample!
+
+```bash
+bwa-mem2 mem -t 2 SalmonReference.fasta SalmonSim.Stabilising.p10.i1.80000_R1.fastq.gz SalmonSim.Stabilising.p10.i1.80000_R2.fastq.gz > SalmonSim.Stabilising.p10.i1.80000.sam
+```
+
+This will output a lot of text as it progresses. For most projects, this step will take minutes or hours,
+but here we're working with a small dataset of simulated reads from a small genome, so it finishes
+immediately. 
+
+We now have our data in "sam" format. This stands for Sequence Alignment/Map format. In the sam file,
+we have our base calls for each read, along with the quality score. This means a sam file has all the information
+in our original fastq format, plus additional info! Specifically, the sam file contains information about
+whether and where each read has mapped in the genome. 
+
+```bash
+head SalmonSim.Stabilising.p10.i1.80000.sam | less -S
+```
+
+```output
+@SQ     SN:chr_1        LN:5000000
+@SQ     SN:chr_2        LN:5000000
+@PG     ID:bwa-mem2     PN:bwa-mem2     VN:2.2.1        CL:bwa-mem2 mem -t 2 SalmonReference.fasta SalmonSim.Stabi>
+chr_1_1_0_0     99      chr_1   4629813 60      126M    =       4630544 857     CTCTGCCATGCTGTGTTGTCATGTGTTACTGTCA>
+chr_1_1_0_0     147     chr_1   4630544 60      126M    =       4629813 -857    TGGGGTCCAATAAAACCTAGTAGTATCTTATATT>
+chr_1_1_1_0     99      chr_1   3307901 60      126M    =       3308085 310     TTCACTATTTTGTGGTAGCTCTGAACTTGCTCAC>
+chr_1_1_1_0     147     chr_1   3308085 60      126M    =       3307901 -310    TGCTTGACGGATTACTTGGGGGGGGGGTTGACAC>
+chr_1_1_2_0     99      chr_1   2893195 60      126M    =       2894284 1215    CACCTGCCTTGTATTTCTCAGATTTCTATCTGAG>
+chr_1_1_2_0     147     chr_1   2894284 60      126M    =       2893195 -1215   TTGAGGCCTCATCTTCACTTTCACTTTCCAACCT>
+chr_1_1_3_0     99      chr_1   4532319 60      126M    =       4532871 678     TCAGTGCCTCTTTACTTGACATCATGGGAAAATC>
+```
+
+The first two lines are headers, which list all the contigs in the reference genome, along with their size. After that,
+we have more header lines that specify all the programs and commands used to create this file. In this case,
+it has the bwa-mem2 call (and specific information about the flags used for that command call). As we 
+modify the sam file, more lines will get added to the header recording the steps we did. This
+is very helpful if you forget what you did exactly. 
+
+
+
+
+
+
+
 
 
-#Download some data
-#Download a reference genome
-#Index your reference genome
-#Align data
 #Look around bam file
 #Sort bam file.
 #Extract single region of bam file