add shannon entropy, redo readme

kdm9 · Oct 28, 2023 · 6dd9d5f · 6dd9d5f
1 parent 4e2db17
commit 6dd9d5f
Show file tree

Hide file tree

Showing 3 changed files with 75 additions and 18 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,8 @@
 # blindschleiche
 
-Misc sequence tools in python. These tools are small things which I have at some point needed to write because I couldn't find a solution I liked. This is in no way a comprehensive toolkit. It is a companion to [seqhax](https://github.com/kdm9/seqhax), execpt that seqhax is written in C/C++ and generally contains tools to handle very large datasets where performance is somewhat important. This is all in python for ease of development, and so typically these tools perform less data- or compute-intensive tasks.
+A collection of bioinformatics / sequence utilities needed for my research, and hopefully useful for yours.
+
+[![DOI](https://zenodo.org/badge/509693094.svg)](https://zenodo.org/doi/10.5281/zenodo.10049825)
 
 ## Install
 
@@ -15,38 +17,40 @@ pip install blindschleiche
 ```
 USAGE: blsl <subtool> [options...]
 
-
 Where <subtool> is one of:
 
-  telogrep:             Search contigs for known telomere repeats
-  n50:                  Calculate N50 and total length of a set of contigs
+  deepclust2fa:         Split a .faa by the clusters diamond deepclust finds
+  ebiosra2rl2s:         INTERNAL: MPI Tübingen tool. Make a runlib-to-sample map table from ebio sra files
+  equalbestblast:       Output only the best blast hits.
+  esearchandfetch:      Use the Entrez API to search for and download something. A CLI companion to the NCBI search box
   falen:                Tabulate the lengths of sequences in a FASTA file
-  mask2bed:             The inverse of bedtools maskfasta: softmasked fasta -> unmasked fasta + mask.bed
+  farename:             Rename sequences in a fasta file sequentially
+  galhist:              Make a summary histogram of git-annex-list output
   genigvjs:             Generate a simple IGV.js visualisation of some bioinf files.
-  liftoff-gff3:         Obtain an actually-useful GFF3 from Liftoff by fixing basic GFF3 format errors
-  pansn-rename:         Add, remove, or modify PanSN-style prefixes to contig/chromosome names in references
+  gffcat:               Concatenate GFF3 files, resepcting header lines and FASTA sections
+  gg2k:                 Summarise a table with GreenGenes-style lineages into a kraken-style report.
   ildemux:              Demultiplex modern illumina reads from read headers.
   ilsample:             Sample a fraction of read pairs from an interleaved fastq file
-  regionbed:            Make a bed/region file of genome windows
-  uniref-acc2taxid:     Make a ncbi-style acc2taxid.map file for a uniref fasta
+  liftoff-gff3:         Obtain an actually-useful GFF3 from Liftoff by fixing basic GFF3 format errors
+  mask2bed:             The inverse of bedtools maskfasta: softmasked fasta -> unmasked fasta + mask.bed
+  n50:                  Calculate N50 and total length of a set of contigs
   nstitch:              Combine R1 + R2 into single sequences, with an N in the middle
-  gg2k:                 Summarise a table with GreenGenes-style lineages into a kraken-style report.
-  equalbestblast:       Output only the best blast hits.
+  pairslash:            Add an old-style /1 /2 pair indicator to paired-end fastq files
+  pansn-rename:         Add, remove, or modify PanSN-style prefixes to contig/chromosome names in references
+  regionbed:            Make a bed/region file of genome windows
+  shannon-entropy:      Calculate Shannon's entropy (in bits) at each column of one or more alignments
   tabcat:               Concatenate table (c/tsv) files, adding the filename as a column
-  esearchandfetch:      Use the Entrez API to search for and download something. A CLI companion to the NCBI search box
-  deepclust2fa:         Split a .faa by the clusters diamond deepclust finds
-  farename:             Rename sequences in a fasta file sequentially
-  ebiosra2rl2s:         INTERNAL: MPI Tübingen tool. Make a runlib-to-sample map table from ebio sra files
-  galhist:              Make a summary histogram of git-annex-list output
+  telogrep:             Search contigs for known telomere repeats
+  uniref-acc2taxid:     Make a ncbi-style acc2taxid.map file for a uniref fasta
   help:                 Print this help message
 
 
 Use blsl subtool --help to get help about a specific tool
 ```
 
-## Why Blindschleiche
+## Why the name Blindschleiche?
 
 1) [They're awesome animals](https://www.google.com/search?q=blindschleiche&tbm=isch)
 2) Their English name is Slow Worm, which is appropriate for this set of low-performance tools in Python.
-3) All tools implemented in python must be named with a snake pun, and they're kinda a snake (not really, they're legless lizards)
+3) All tools implemented in Python must be named with a snake pun, and they're kinda a snake (not really, they're legless lizards)
 
diff --git a/blsl/__init__.py b/blsl/__init__.py
@@ -83,6 +83,8 @@
 except ImportError as exc:
     print(str(exc), "-- disabling vcfstats command", file=stderr)
 
+from .shannon import main as shannon_main
+cmds["shannon-entropy"] = shannon_main
 
 def mainhelp(argv=None):
     """Print this help message"""

diff --git a/blsl/shannon.py b/blsl/shannon.py
@@ -0,0 +1,51 @@
+#!/usr/bin/env python3
+from Bio import AlignIO
+from tqdm import tqdm
+
+import argparse
+from sys import stdin, stdout, stderr
+from math import log
+from statistics import mean
+from collections import Counter
+
+class FreqCounter(Counter):
+    def freqs(self):
+        N = sum(self.values())
+        return {b:v/N for b, v in self.most_common()}
+
+def shannon_entropy(alncol):
+    alncol = [b for b in alncol if b not in ("-", "*", "X")]
+    aafreq = FreqCounter(alncol)
+    entropy = -0
+    for aa, freq in aafreq.freqs().items():
+        entropy += freq*(log(freq,2))
+    return -entropy
+
+def shanent_aln(alignment, alnformat="fasta"):
+    aln = AlignIO.read(alignment, alnformat)
+    if len(aln) < 15:
+        print(f"WARN: {alignment} has <15 seqs, treat SE values with a grain of salt", file=stderr)
+    se = []
+    for i in range(aln.get_alignment_length()):
+        se.append(shannon_entropy(aln[:,i]))
+    return aln, se
+
+def main(argv=None):
+    """Calculate Shannon's entropy (in bits) at each column of one or more alignments"""
+    ap = argparse.ArgumentParser()
+    ap.add_argument("-o", "--out-tsv", default=stdout, type=argparse.FileType("wt"),
+            help="Output TSV file")
+    ap.add_argument('alignments', nargs="+",
+            help='One or more MSAs, in fasta format')
+    args = ap.parse_args()
+
+    print("alnfile", "nseqs",  "column", "shannon_entropy", sep="\t", file=args.out_tsv)
+    for alnfile in tqdm(args.alignments, unit=" Alignment"):
+        aln, se = shanent_aln(alnfile)
+        for i, s in enumerate(se):
+            print(alnfile, len(aln), i+1, s, sep="\t", file=args.out_tsv)
+        mean(se)
+        print(alnfile, len(aln), "*", mean(se), sep="\t", file=args.out_tsv)
+
+if __name__ == '__main__':
+    main()