From 82e11e2aeb84794fcd8edab93e11f5e9a6964d6c Mon Sep 17 00:00:00 2001 From: Gregory Owens <5419829+owensgl@users.noreply.github.com> Date: Tue, 13 Feb 2024 15:43:24 -0800 Subject: [PATCH] Update lab_6.md --- labs/lab_6.md | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 128 insertions(+) diff --git a/labs/lab_6.md b/labs/lab_6.md index a82534e46..eb2ff6c04 100644 --- a/labs/lab_6.md +++ b/labs/lab_6.md @@ -225,6 +225,134 @@ Depth indicates how many reads are on each base (on average) and coverage tells you how much of the reference sequence is covered by at least one base. +### Questions +1) Use samtools stats to find out how many reads had MAPQ less than 10 in sample "SalmonSim.Stabilising.p10.i1.80000". +2) Use samtools flagstats to find out what percent of reads are mapped to the genome. + + +We can also look at the alignment itself by using samtools tview. +```bash +samtools tview SalmonSim.Stabilising.p10.i1.80000.sort.markdup.bam SalmonReference.fasta +``` +Once you're inside, use the "?" key to bring up the help menu. Then use keys to move +and look at your alignment. For this sample, we have very little differences between +our sample and the reference, but when we start doing variant calling we can use +this to actually look at the reads and make sure that the variants are real. + +*** + +We often sequence many samples, and each of them needs to be processed individually. +We could make one long script that does each step for each sample individually, +but it's much more manageable to use loops to repeat the same steps for +many different samples. Lets go through a toy example to learn some tricks. + +First lets start with a list of fastq files. You can make this file in nano by +copying and pasting and name it "fastq_files.txt". You could imagine making +this file using the "ls" command with real data. +```bash +cat fastq_files.txt +``` +```output +sample01_R1.fastq.gz +sample01_R2.fastq.gz +sample02_R1.fastq.gz +sample02_R2.fastq.gz +sample29_R1.fastq.gz +sample29_R2.fastq.gz +``` + +First thing is that we have both R1 and R2 files, but we're going to be treating +those together as a single sample. So lets filter out the R2 lines using "grep" + +```bash +cat fastq_files.txt | grep -v R2.fastq.gz > fastq_files.R1.txt +``` +Now we have a list of files that we want to work on. Here are two ways +we can loop through each line in that file. + +Option 1: +```bash +for x in `cat fastq_files.R1.txt` +do + echo $x +done +``` +In this option, we're doing a creating our list of $x variables by using +a mini command call in the `` marks. We could modify it on the fly with additional filters. +For example +Option 1a: +```bash +for x in `cat fastq_files.R1.txt | grep -v sample29` +do + echo $x +done +``` +Here we've excluded sample29. + +In the second option, we use a "while" loop and the "read" command. +The file that it is reading is specified at the end of the while loop. +Option 2: +```bash +while read x; do + echo $x +done < fastq_files.R1.txt +``` + +A second challenge when making shell scripts is that we often want to +make a series of files with similar names but different suffixes. In this case +we start off with "sample29_R1.fastq.gz". The bam file we create from this should +be sample29.bam, not sample29_R1.fastq.gz.bam. + +Here are two options for removing suffixes. +Option 1: +```bash +x=sample01_R1.fastq.gz +y=$(basename $x _R1.fastq.gz) +echo $y +``` +The "basename" removes a suffix that you specify for a string. It +can also function to remove directories before a filename. +```bash +x=fastq/data/sample01_R1.fastq.gz +y=$(basename $x) +echo $y +``` + +A more general approach is to use sed (a.k.a. string editor) +to modify a variable. Sed can be used to find and substitute +any character pattern in a string. It works like this: +s/character to find/character to replace it with/g +The "s" at the start is for subtitution. +The "g" at the end is for global (which means it finds all instances +of the replacement, not just the first). + +Take a look at these examples and see if you understand how they work: +```bash +echo "cats" | sed s/a/o/g + +echo "cats" | sed s/a//g + +echo "cats" | sed s/a/aaaaaa/g + +echo "cats" | sed s/ats/ats_are_funky/g +``` +We can use this to rename variables. +```bash +x=sample01_R1.fastq.gz +y=$(echo $x | sed s/_R1.fastq.gz//g) +echo $y +``` + + + + + + + + + + +