Skip to content

Commit

Permalink
Update lab_5.md
Browse files Browse the repository at this point in the history
  • Loading branch information
owensgl authored Feb 2, 2024
1 parent 5599668 commit 061ba0b
Showing 1 changed file with 141 additions and 0 deletions.
141 changes: 141 additions & 0 deletions labs/lab_5.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,147 @@ ls
cds_from_genomic.fna GCF_000146045.2_R64_genomic.fna genomic.gff protein.faa rna.fna sequence_report.jsonl
```

First lets talk about "GCF_000146045.2_R64_genomic.fna". This is the genome sequence in fasta format.
It uses a .fna file suffix because it's a fasta with only nucleotide information (and not amino acids, for example).

```bash
head GCF_000146045.2_R64_genomic.fna
```
```output
>NC_001133.9 Saccharomyces cerevisiae S288C chromosome I, complete sequence
ccacaccacacccacacacccacacaccacaccacacaccacaccacacccacacacacacatCCTAACACTACCCTAAC
ACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCAT
TCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATAT
TGAAACGCTAACAAATGATCGTAAATAACACACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCATACTCAC
CCTCACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTC
CACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACGGCACTTGCCTCAGCGG
TCTATACCCTGTGCCATTTACCCATAACGCCCATCATTATCCACATTTTGATATCTATATCTCATTCGGCGGTcccaaat
attgtataaCTGCCCTTAATACATACGTTATACCACTTTTGCACCATATACTTACCACTCCATTTATATACACTTATGTC
```

Each chromosome (or contig) will be represented by a different fasta entry.
You can see that this has an ID for the first chromosome "NC_001133.9" as well
as a description of that entry. Some of the bases are lowercase, while others are
uppercase. The lowercase bases have been identified as being repetitive and are
what we call "soft-masked". Some programs will ignore soft-masked regions of the genome.

We can manipulate the reference fasta file using samtools.
```bash
#First we load the samtools module
module load samtools/1.18

#Then we need to index the fasta file.
samtools faidx GCF_000146045.2_R64_genomic.fna

#This creates the index file "GCF_000146045.2_R64_genomic.fna.fai".
#Sometimes index files are only machine-readable, but in this case the index file is actually helpful
head GCF_000146045.2_R64_genomic.fna.fai
```
```output
NC_001133.9 230218 76 80 81
NC_001134.8 813184 233249 80 81
NC_001135.5 316620 1056676 80 81
NC_001136.10 1531933 1377332 80 81
NC_001137.3 576874 2928491 80 81
NC_001138.5 270161 3512653 80 81
NC_001139.9 1090940 3786270 80 81
NC_001140.6 562643 4890926 80 81
NC_001141.2 439888 5460680 80 81
NC_001142.9 745751 5906143 80 81
```
This shows all of the contigs in the fasta file (first column) and the second column shows the number
of basepairs in each contig.

### Activity
- Sometimes you will want to extract the sequence from a specific region of the genome.
You can do this using samtools faidx. Figure out how to output the region NC_001135.5:10030-10130.
- Other times, you'll want to work on only a subset of chromosomes. Find the two longest chromosomes
and output only those two chromosomes to a separate fasta file.

***

The directory also contains information on the annotated genes in the genome. Specifically:
1) "cds_from_genomic.fna". This is the coding sequence for all genes
2) "rna.fna". This is the RNA sequence for all genes
3) "protein.faa" This is the translated amino acid sequence for all genes.
4) "genomic.gff" This includes the locations and descriptions of all features in the genome (not only genes)

The cds and rna are very similar, but the cds only includes bases that code for amino acids while
the rna includes transcribed bases upstream of the start codon and downstream of the stop codon.

Some genome assemblies have additional information files and you can find their definitions here: https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/

Lets take a look a the gff file.

```bash
#We're piping it to `less -S` to prevent line wrapping, because the lines on this
file can be quite long
head -n 20 genomic.gff | less -S
```
```output
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build R64
#!genome-build-accession NCBI_Assembly:GCF_000146045.2
#!annotation-source SGD R64-4-1
##sequence-region NC_001133.9 1 230218
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292
NC_001133.9 RefSeq region 1 230218 . + . ID=NC_001133.9:1..230218;Dbxref=taxon:559292;Na>
NC_001133.9 RefSeq telomere 1 801 . - . ID=id-NC_001133.9:1..801;Dbxref=SGD:S00>
NC_001133.9 RefSeq origin_of_replication 707 776 . + . ID=id-NC_001133.9:707..776;Dbxr>
NC_001133.9 RefSeq gene 1807 2169 . - . ID=gene-YAL068C;Dbxref=GeneID:851229;Name=PAU8;>
NC_001133.9 RefSeq mRNA 1807 2169 . - . ID=rna-NM_001180043.1;Parent=gene-YAL068C;Dbxre>
NC_001133.9 RefSeq exon 1807 2169 . - . ID=exon-NM_001180043.1-1;Parent=rna-NM_00118004>
NC_001133.9 RefSeq CDS 1807 2169 . - 0 ID=cds-NP_009332.1;Parent=rna-NM_001180043.1;Db>
NC_001133.9 RefSeq gene 2480 2707 . + . ID=gene-YAL067W-A;Dbxref=GeneID:1466426;Name=YA>
NC_001133.9 RefSeq mRNA 2480 2707 . + . ID=rna-NM_001184582.1;Parent=gene-YAL067W-A;Dbx>
NC_001133.9 RefSeq exon 2480 2707 . + . ID=exon-NM_001184582.1-1;Parent=rna-NM_00118458>
NC_001133.9 RefSeq CDS 2480 2707 . + 0 ID=cds-NP_878038.1;Parent=rna-NM_001184582.1;Db>
NC_001133.9 RefSeq gene 7235 9016 . - . ID=gene-YAL067C;Dbxref=GeneID:851230;Name=SEO1;>
```

The start of the gff is a header, marked by the "#". It tells you which specific gff version you have and where it came from.
The first few columns tell us:
1) Chromosome or contig
2) Source of the feature
3) What type of feature
4) The start of the feature
5) The end of the feature
6) The score for the feature (in this case blank)
7) The strand the feature is on
8) The phase of the feature (in this case, what codon frame the CDS is in)
9) Additional info
You can read more about how gff format works here: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

In many cases, you'll want to know which features are in a particular part of the genome. Since a gff file is
actually just text, we can use some basic unix tools to filter it. A convenient option is "awk".

```bash
awk '$1 == "NC_001133.9" && $4 >= 10290 && $5 <= 27968' genomic.gff
```
This filter for features on chromosome NC_001133.9 that start after 10290 and end before 27968.

### Activity
- Find all the features on NC_001148.4 from 1 to 10000 bp.
- Find all the origin of replications on NC_001133.9. How many are there?
- The previous awk command can sometimes miss features that overlap with your window. Here's an
improved version of the command
```bash
chromosome="NC_001133.9"
start_position=10290
end_position=27968

awk -v chrom="$chromosome" -v start="$start_position" -v end="$end_position" \
'($1 == chrom && ($4 <= end && $5 >= start)) || \
($1 == chrom && $4 >= start && $4 <= end) || \
($1 == chrom && $5 >= start && $5 <= end) || \
($1 == chrom && $4 <= start && $5 >= end)' genomic.gff
```
Generally speak, which features would this command capture that the previous command would miss?

***



Expand Down

0 comments on commit 061ba0b

Please sign in to comment.