Skip to content

Commit

Permalink
Update lab_7.md
Browse files Browse the repository at this point in the history
  • Loading branch information
owensgl authored Feb 25, 2024
1 parent 09ca166 commit 87ac806
Showing 1 changed file with 81 additions and 0 deletions.
81 changes: 81 additions & 0 deletions labs/lab_7.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,88 @@ cd ../reference
ln -s /project/ctb-grego/sharing/lab_7/reference/* ./
```

We're first going to use bcftools mpileup to call genotypes. We need to create a file that tells bcftools
where all the bam files are and what their names are. We can do that using shell commands.
```bash
cd ~/lab_7
ls bam | grep -v ".bai" | sed 's/^/bam\//g' > bamlist.txt
```
### Question
1. Break down the command used to create bamlist.txt and explain what step is doing.


Now we're going to call the first step, bcftools mpileup.

```bash
module load StdEnv/2023 gcc/12.3 bcftools/1.18

#We're using the -q 20 option here to require that reads have mapq >= 20.
bcftools mpileup -q 20 -f reference/SalmonReference.fasta -b bamlist.txt > lab_7.gvcf
```

The output of this is like a vcf file but it includes information on all sites in the genome,
but hasn't actually called genotypes yet. This is an intermediate file format before we create
the final vcf. Lets take a look.

```bash
less lab_7.gvcf
```
```output
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.18+htslib-1.18
##bcftoolsCommand=mpileup -q 20 -f reference/SalmonReference.fasta -b bamlist.txt
##reference=file://reference/SalmonReference.fasta
##contig=<ID=chr_1,length=5000000>
##contig=<ID=chr_2,length=5000000>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias (closer to 0 is better)">
##INFO=<ID=MQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality Bias (closer to 0 is better)">
##INFO=<ID=BQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Base Quality Bias (closer to 0 is better)">
##INFO=<ID=MQSBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality vs Strand Bias (closer to 0 is better)">
##INFO=<ID=SCBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Soft-Clip Length Bias (closer to 0 is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric, http://samtools.github.io/bcftools/rd-SegBias.pdf">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT bam/sample_0.bam bam/sample_1.bam bam/sample_2.bam bam/sample_3.bam ba>
chr_1 1 . T <*> 0 . DP=1;I16=1,0,0,0,33,1089,0,0,60,3600,0,0,0,0,0,0;QS=1,0;MQ0F=0 PL 0,0,0 0,0,0 0,0,0 0,3,33 0,0,0 0,>
chr_1 2 . G <*> 0 . DP=1;I16=1,0,0,0,14,196,0,0,60,3600,0,0,1,1,0,0;QS=1,0;MQ0F=0 PL 0,0,0 0,0,0 0,0,0 0,3,14 0,0,0 0,>
chr_1 3 . G <*> 0 . DP=1;I16=1,0,0,0,33,1089,0,0,60,3600,0,0,2,4,0,0;QS=1,0;MQ0F=0 PL 0,0,0 0,0,0 0,0,0 0,3,33 0,0,0 0,>
chr_1 4 . G <*> 0 . DP=1;I16=1,0,0,0,33,1089,0,0,60,3600,0,0,3,9,0,0;QS=1,0;MQ0F=0 PL 0,0,0 0,0,0 0,0,0 0,3,33 0,0,0 0,>
chr_1 5 . C <*> 0 . DP=1;I16=1,0,0,0,33,1089,0,0,60,3600,0,0,4,16,0,0;QS=1,0;MQ0F=0 PL 0,0,0 0,0,0 0,0,0 0,3,33 0,0,0 0,>
chr_1 6 . A <*> 0 . DP=1;I16=1,0,0,0,37,1369,0,0,60,3600,0,0,5,25,0,0;QS=1,0;MQ0F=0 PL 0,0,0 0,0,0 0,0,0 0,3,37 0,0,0 0,>
chr_1 7 . A <*> 0 . DP=1;I16=1,0,0,0,37,1369,0,0,60,3600,0,0,6,36,0,0;QS=1,0;MQ0F=0 PL 0,0,0 0,0,0 0,0,0 0,3,37 0,0,0 0,>
chr_1 8 . G <*> 0 . DP=1;I16=1,0,0,0,37,1369,0,0,60,3600,0,0,7,49,0,0;QS=1,0;MQ0F=0 PL 0,0,0 0,0,0 0,0,0 0,3,37 0,0,0 0,>
chr_1 9 . G <*> 0 . DP=1;I16=1,0,0,0,37,1369,0,0,60,3600,0,0,8,64,0,0;QS=1,0;MQ0F=0 PL 0,0,0 0,0,0 0,0,0 0,3,37 0,0,0 0,>
chr_1 10 . C <*> 0 . DP=1;I16=1,0,0,0,37,1369,0,0,60,3600,0,0,9,81,0,0;QS=1,0;MQ0F=0 PL 0,0,0 0,0,0 0,0,0 0,3,37 0,0,0 0,>
```

The file has a header denotated by the ## at the start of each line. This tells you what the abbreviations
used in the file mean. Then there's the line that starts with #CHROM, which is the column names for the rest
of the file. We have 9 information columns and then 1 column per sample. Each row after that is for a single
position in the genome. The most important parts of this file are:
- The first column, which is the chromosome.
- The second column, which is the position on that chromosome.
- The fourth column, which is the reference allele (i.e. the allele in the reference genome).
- The fifth column, which is the alternate allele(s). If there is no alternate allele, it is
labeled as <*>.
- The DP value in the info column. This tells you hany reads there is total for this site.
- The 10th column onward shows the genotype likelihoods for each sample. The three values, separated
by ",", are the phred scaled likelihoods for 0/0, 0/1 and 1/1 respectively. A 0 phred scaled likelihood,
is the most likely. Higher values are less likely. Ultimately, the program is going to pick a single
genotype value based on comparing these likelihoods.





bcftools mpileup -q 20 -d 200 -f Sockeye/ReferenceGenome/SalmonReference.fasta -b bamlist.txt | bcftools call -mv > Biol470.vcf



0 comments on commit 87ac806

Please sign in to comment.