Skip to content

Commit

Permalink
Update AWK_module.md
Browse files Browse the repository at this point in the history
  • Loading branch information
eberdan authored Mar 28, 2024
1 parent a50d5ea commit e98d727
Showing 1 changed file with 27 additions and 2 deletions.
29 changes: 27 additions & 2 deletions Finding_and_summarizing_colossal_files/lessons/AWK_module.md
Original file line number Diff line number Diff line change
Expand Up @@ -338,6 +338,29 @@ It works! We can see that "moose,bison" is the most commonly observed group of a
</details>
****

### Bioinformatic Application

Counting can be a great way to summarize different annotation files (gff, gff3, gtf etc). This is especially true when working with new files that have been generated by other people. Here is the gff we showed above but slightly edited

```
chr3 entrez five_prime_UTR 50252100 50252137 . + . ID=UTR5:ENST00000266027.9;Parent=ENST00000266027.9;gene_id=ENSG00000114353.17;transcript_id=ENST00000266027.9;gene_type=protein_coding;gene_name=GNAI2;transcript_type=protein_coding;transcript_name=GNAI2-201;exon_number=2;exon_id=ENSE00003567505.1;level=3;protein_id=ENSP00000266027.6;transcript_support_level=2;hgnc_id=HGNC:4385;tag=basic,CCDS;ccdsid=CCDS63644.1;havana_gene=OTTHUMG00000156940.2
chr3 ENSEMBL three_prime_UTR 50257691 50257714 . + . ID=UTR3:ENST00000266027.9;Parent=ENST00000266027.9;gene_id=ENSG00000114353.17;transcript_id=ENST00000266027.9;gene_type=protein_coding;gene_name=GNAI2;transcript_type=protein_coding;transcript_name=GNAI2-201;exon_number=8;exon_id=ENSE00003524043.1;level=3;protein_id=ENSP00000266027.6;transcript_support_level=2;hgnc_id=HGNC:4385;tag=basic,CCDS;ccdsid=CCDS63644.1;havana_gene=OTTHUMG00000156940.2
chr3 entrez three_prime_UTR 50258368 50259339 . + . ID=UTR3:ENST00000266027.9;Parent=ENST00000266027.9;gene_id=ENSG00000114353.17;transcript_id=ENST00000266027.9;gene_type=protein_coding;gene_name=GNAI2;transcript_type=protein_coding;transcript_name=GNAI2-201;exon_number=9;exon_id=ENSE00001349779.3;level=3;protein_id=ENSP00000266027.6;transcript_support_level=2;hgnc_id=HGNC:4385;tag=basic,CCDS;ccdsid=CCDS63644.1;havana_gene=OTTHUMG00000156940.2
chr3 ENSEMBL gene 50227436 50227490 . + . ID=ENSG00000275334.1;gene_id=ENSG00000275334.1;gene_type=miRNA;gene_name=MIR5787;level=3;hgnc_id=HGNC:49930
chr3 entrez gene 52560570 52560707 . + . ID=ENSG00000221518.1;gene_id=ENSG00000221518.1;gene_type=snRNA;gene_name=RNU6ATAC16P;level=3;hgnc_id=HGNC:46915
chr3 ENSEMBL transcript 52560570 52560707 . + . ID=ENST00000408591.1;Parent=ENSG00000221518.1;gene_id=ENSG00000221518.1;transcript_id=ENST00000408591.1;gene_type=snRNA;gene_name=RNU6ATAC16P;transcript_type=snRNA;transcript_name=RNU6ATAC16P-201;level=3;transcript_support_level=NA;hgnc_id=HGNC:46915;tag=basic,Ensembl_canonical
```

The second column tells us where the annotation comes from and the third column tells us what kind of feature it is. Both of these columns can be useful to summarize when you are starting to work with a new gtf file.

```bash
awk ' { counter[$2] += 1 } END { for (source in counter){ print source, counter[source] } }' my_gtf.gtf
```


```bash
awk ' { counter[$3] += 1 } END { for (feature in counter){ print feature, counter[feature] } }' my_gtf.gtf
```

## MFC

Expand All @@ -352,7 +375,7 @@ samtools view -S -b ${sam}.sam > ${sam}.bam
done
```

This actually combines a number of basic and intermediate shell topics such as [positional parameters](positional_params.md), [for loops](loops_and_scripts.md), and `awk`!
This actually combines a number of basic and intermediate shell topics such as [positional parameters]([positional_params.md](https://hbctraining.github.io/Training-modules/Accelerate_with_automation/lessons/positional_params.html)), [for loops](https://hbctraining.github.io/Training-modules/Accelerate_with_automation/lessons/loops_and_scripts.html), and `awk`!

* We start with a for loop that counts from 1 to 10

Expand Down Expand Up @@ -391,6 +414,8 @@ Why do you think that this is MFC?

# Additional cool `awk` commands

For these commands we will return to ecosystems.txt

### BEGIN

The `BEGIN` command will execute an `awk` expression once at the beginning of a command. This can be particularly useful it you want to give an output a header that doesn't previously have one.
Expand All @@ -403,7 +428,7 @@ In this case we have told `awk` that we want to have "new_header" printed before

### END

Related to the `BEGIN` command, the `END` command that tells `awk` to do a command once at the end of the file. It is ***very*** useful when summing up columns (below), but we will first demonstrate how it works by adding a new record:
We already had some experience with `END` above. Related to the `BEGIN` command, the `END` command that tells `awk` to do a command once at the end of the file. We will first demonstrate how it works by adding a new record:

```
awk '{print $1} END {print "new_record"}' ecosystems.txt
Expand Down

0 comments on commit e98d727

Please sign in to comment.