Asynchronous library and command-line utility for checksumming FASTA files and individual contigs.
Implements two checksumming algorithms: MD5
and GA4GH
, in order to fulfill the needs of the
Refget v2 API specification.
To install fasta-checksum-utils
, run the following pip
command:
pip install fasta-checksum-utils
To generate a text report of checksums in the FASTA document, run the following command:
fasta-checksum-utils ./my-fasta.fa[.gz]
This will print output in the following tab-delimited format:
file [file size in bytes] md5 [file MD5 hash] ga4gh [file GA4GH hash]
chr1 [chr1 sequence length] md5 [chr1 sequence MD5 hash] ga4gh [chr1 sequence GA4GH hash]
chr2 [chr2 sequence length] md5 [chr2 sequence MD5 hash] ga4gh [chr2 sequence GA4GH hash]
...
The following example is the output generated by specifying the SARS-CoV-2 genome FASTA from NCBI:
file 30428 md5 825ab3c54b7a67ff2db55262eb532438 ga4gh SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd
NC_045512.2 29903 md5 105c82802b67521950854a851fc6eefd ga4gh SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D
If the --out-format bento-json
arguments are passed, the tool will instead output the report in a JSON
format, designed to be compatible with the requirements of the
Bento Reference Service. The following example
is the output generated by specifying the SARS-CoV-2 genome:
{
"fasta": "sars_cov_2.fa",
"fasta_size": 30428,
"md5": "825ab3c54b7a67ff2db55262eb532438",
"ga4gh": "SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd",
"contigs": [
{
"name": "NC_045512.2",
"md5": "105c82802b67521950854a851fc6eefd",
"ga4gh": "SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D",
"length": 29903
}
]
}
If an argument like --fai [path or URL]
is passed, an additional "fai": "..."
property will be added to the JSON
object output.
If an argument like --genome-id GRCh38
is provided, an additional "id": "GRCh38"
property will be added to the
JSON object output.
Below are some examples of how fasta-checksum-utils
can be used as an asynchronous Python library:
import asyncio
import fasta_checksum_utils as fc
import pysam
from pathlib import Path
async def demo():
covid_genome: Path = Path("./sars_cov_2.fa")
# calculate an MD5 checksum for a whole file
file_checksum: str = await fc.algorithms.AlgorithmMD5.checksum_file(covid_genome)
print(file_checksum)
# prints "863ee5dba1da0ca3f87783782284d489"
all_algorithms = (fc.algorithms.AlgorithmMD5, fc.algorithms.AlgorithmGA4GH)
# calculate multiple checksums for a whole file
all_checksums: tuple[str, ...] = await fc.checksum_file(file=covid_genome, algorithms=all_algorithms)
print(all_checksums)
# prints tuple: ("863ee5dba1da0ca3f87783782284d489", "SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd")
# calculate an MD5 and GA4GH checksum for a specific contig in a PySAM FASTA file:
fh = pysam.FastaFile(str(covid_genome))
try:
contig_checksums: tuple[str, ...] = await fc.checksum_contig(
fh=fh,
contig_name="NC_045512.2",
algorithms=all_algorithms,
)
print(contig_checksums)
# prints tuple: ("105c82802b67521950854a851fc6eefd", "SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D")
finally:
fh.close() # always close the file handle
asyncio.run(demo())