Skip to content

Latest commit

 

History

History
19 lines (13 loc) · 1.88 KB

README.md

File metadata and controls

19 lines (13 loc) · 1.88 KB

PopGen.awk

Collection of AWK scripts for population & evolutionary genomics. Maybe a library later.

Motivation

Even though existing tools sometimes claim to do what I need, I often find they do it incorrectly, or just not the way I need. Since I like coding, I started to write my own scripts in AWK, a simple scripting language designed to process structured textual data - exactly the kind we see in genomics. As my scripts started to grow in numbers, managing them in separate gists was getting clumsy and so I've made this repository to keep the scripts in one place, and to make code reuse easier. Besides AWK, some R, Miller, or shell may show up here too.

Dependencies

Most of the scripts here can be used with any version of modern awk - if in doubt, GNU awk (aka gawk) should work well. GNU awk has the most features and directly supports libraries (although library support can be coded with other awks too).

However, some scripts may require a particular interpreter (such as gawk or bioawk). In such case I note this requirement with the script shebang and file extension (e.g. .gawk or .bioawk). I also use mawk for speed when I can.

All the AWK interpreters used in this repository can be easily installed with package managers like conda or brew. Moreover, I often use bcftools query to convert raw VCF into tabular format suitable for many of the scripts.

For example, you can use the following conda command to install all the necessary dependencies in a new environment:

conda create --name popgen-awk --channel conda-forge --channel bioconda gawk mawk=1.3.4 bioawk bcftools miller # r-base r-seqinr r-ape
conda activate popgen-awk