Skip to content

Commit

Permalink
add shannon entropy, redo readme
Browse files Browse the repository at this point in the history
  • Loading branch information
kdm9 committed Oct 28, 2023
1 parent 4e2db17 commit 6dd9d5f
Show file tree
Hide file tree
Showing 3 changed files with 75 additions and 18 deletions.
40 changes: 22 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# blindschleiche

Misc sequence tools in python. These tools are small things which I have at some point needed to write because I couldn't find a solution I liked. This is in no way a comprehensive toolkit. It is a companion to [seqhax](https://github.com/kdm9/seqhax), execpt that seqhax is written in C/C++ and generally contains tools to handle very large datasets where performance is somewhat important. This is all in python for ease of development, and so typically these tools perform less data- or compute-intensive tasks.
A collection of bioinformatics / sequence utilities needed for my research, and hopefully useful for yours.

[![DOI](https://zenodo.org/badge/509693094.svg)](https://zenodo.org/doi/10.5281/zenodo.10049825)

## Install

Expand All @@ -15,38 +17,40 @@ pip install blindschleiche
```
USAGE: blsl <subtool> [options...]
Where <subtool> is one of:
telogrep: Search contigs for known telomere repeats
n50: Calculate N50 and total length of a set of contigs
deepclust2fa: Split a .faa by the clusters diamond deepclust finds
ebiosra2rl2s: INTERNAL: MPI Tübingen tool. Make a runlib-to-sample map table from ebio sra files
equalbestblast: Output only the best blast hits.
esearchandfetch: Use the Entrez API to search for and download something. A CLI companion to the NCBI search box
falen: Tabulate the lengths of sequences in a FASTA file
mask2bed: The inverse of bedtools maskfasta: softmasked fasta -> unmasked fasta + mask.bed
farename: Rename sequences in a fasta file sequentially
galhist: Make a summary histogram of git-annex-list output
genigvjs: Generate a simple IGV.js visualisation of some bioinf files.
liftoff-gff3: Obtain an actually-useful GFF3 from Liftoff by fixing basic GFF3 format errors
pansn-rename: Add, remove, or modify PanSN-style prefixes to contig/chromosome names in references
gffcat: Concatenate GFF3 files, resepcting header lines and FASTA sections
gg2k: Summarise a table with GreenGenes-style lineages into a kraken-style report.
ildemux: Demultiplex modern illumina reads from read headers.
ilsample: Sample a fraction of read pairs from an interleaved fastq file
regionbed: Make a bed/region file of genome windows
uniref-acc2taxid: Make a ncbi-style acc2taxid.map file for a uniref fasta
liftoff-gff3: Obtain an actually-useful GFF3 from Liftoff by fixing basic GFF3 format errors
mask2bed: The inverse of bedtools maskfasta: softmasked fasta -> unmasked fasta + mask.bed
n50: Calculate N50 and total length of a set of contigs
nstitch: Combine R1 + R2 into single sequences, with an N in the middle
gg2k: Summarise a table with GreenGenes-style lineages into a kraken-style report.
equalbestblast: Output only the best blast hits.
pairslash: Add an old-style /1 /2 pair indicator to paired-end fastq files
pansn-rename: Add, remove, or modify PanSN-style prefixes to contig/chromosome names in references
regionbed: Make a bed/region file of genome windows
shannon-entropy: Calculate Shannon's entropy (in bits) at each column of one or more alignments
tabcat: Concatenate table (c/tsv) files, adding the filename as a column
esearchandfetch: Use the Entrez API to search for and download something. A CLI companion to the NCBI search box
deepclust2fa: Split a .faa by the clusters diamond deepclust finds
farename: Rename sequences in a fasta file sequentially
ebiosra2rl2s: INTERNAL: MPI Tübingen tool. Make a runlib-to-sample map table from ebio sra files
galhist: Make a summary histogram of git-annex-list output
telogrep: Search contigs for known telomere repeats
uniref-acc2taxid: Make a ncbi-style acc2taxid.map file for a uniref fasta
help: Print this help message
Use blsl subtool --help to get help about a specific tool
```

## Why Blindschleiche
## Why the name Blindschleiche?

1) [They're awesome animals](https://www.google.com/search?q=blindschleiche&tbm=isch)
2) Their English name is Slow Worm, which is appropriate for this set of low-performance tools in Python.
3) All tools implemented in python must be named with a snake pun, and they're kinda a snake (not really, they're legless lizards)
3) All tools implemented in Python must be named with a snake pun, and they're kinda a snake (not really, they're legless lizards)

2 changes: 2 additions & 0 deletions blsl/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,8 @@
except ImportError as exc:
print(str(exc), "-- disabling vcfstats command", file=stderr)

from .shannon import main as shannon_main
cmds["shannon-entropy"] = shannon_main

def mainhelp(argv=None):
"""Print this help message"""
Expand Down
51 changes: 51 additions & 0 deletions blsl/shannon.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/usr/bin/env python3
from Bio import AlignIO
from tqdm import tqdm

import argparse
from sys import stdin, stdout, stderr
from math import log
from statistics import mean
from collections import Counter

class FreqCounter(Counter):
def freqs(self):
N = sum(self.values())
return {b:v/N for b, v in self.most_common()}

def shannon_entropy(alncol):
alncol = [b for b in alncol if b not in ("-", "*", "X")]
aafreq = FreqCounter(alncol)
entropy = -0
for aa, freq in aafreq.freqs().items():
entropy += freq*(log(freq,2))
return -entropy

def shanent_aln(alignment, alnformat="fasta"):
aln = AlignIO.read(alignment, alnformat)
if len(aln) < 15:
print(f"WARN: {alignment} has <15 seqs, treat SE values with a grain of salt", file=stderr)
se = []
for i in range(aln.get_alignment_length()):
se.append(shannon_entropy(aln[:,i]))
return aln, se

def main(argv=None):
"""Calculate Shannon's entropy (in bits) at each column of one or more alignments"""
ap = argparse.ArgumentParser()
ap.add_argument("-o", "--out-tsv", default=stdout, type=argparse.FileType("wt"),
help="Output TSV file")
ap.add_argument('alignments', nargs="+",
help='One or more MSAs, in fasta format')
args = ap.parse_args()

print("alnfile", "nseqs", "column", "shannon_entropy", sep="\t", file=args.out_tsv)
for alnfile in tqdm(args.alignments, unit=" Alignment"):
aln, se = shanent_aln(alnfile)
for i, s in enumerate(se):
print(alnfile, len(aln), i+1, s, sep="\t", file=args.out_tsv)
mean(se)
print(alnfile, len(aln), "*", mean(se), sep="\t", file=args.out_tsv)

if __name__ == '__main__':
main()

0 comments on commit 6dd9d5f

Please sign in to comment.