Skip to content

Dump Hash Count Pairs

Adam Taranto edited this page Sep 23, 2024 · 2 revisions

You can use the .dump() method to write hash:count pairs from a KmerCountTable to a tab-delimited output file.

Example data:

import oxli

# Demo table
kct = oxli.KmerCountTable(ksize=4)
kct.count("AAAA")  # Count 'AAAA'
kct.count("TTTT")  # Count revcomp of 'AAAA'
kct.count("AATT")  # Count 'AATT'
kct.count("GGGG")  # Count 'GGGG'
kct.count("GGGG")  # Count again.

# Hashes
#  17832910516274425539 = AAAA/TTTT
# 382727017318141683 = AATT
# 73459868045630124 = GGGG

By default dump() will return unsorted records. Order will vary between runs.

kct.dump()
>>> [(17832910516274425539, 2), (382727017318141683, 1), (73459868045630124, 2)]

Use the sortcounts option to sort records on counts then on keys:

kct.dump(sortcounts=True)
>>> [(382727017318141683, 1), (73459868045630124, 2), (17832910516274425539, 2)]

Use the sortkeys option to sort records on hash keys:

kct.dump(sortkeys=True)
>>> [(73459868045630124, 2), (382727017318141683, 1), (17832910516274425539, 2)]

Sorted hash:count pairs can be written to a tab-delimited text file by specifying an output target:

# Write tab-delimited records to kct.dump
kct.dump(sortcounts=True, file="kct.dump")

If no output file is specified, records are returned as list of (hash,count) tuples (as above).

This list can be converted to a pandas dataframe:

import pandas as pd
table_dump = kct.dump(sortcounts=True)
df = pd.DataFrame(table_dump, columns=['Hash', 'Count'])
print(df)
>>>
  '''
                     Hash  Count
  0    382727017318141683      1
  1     73459868045630124      2
  2  17832910516274425539      2
  '''

If table is empty, returns empty list:

empty_kct = oxli.KmerCountTable(ksize=4)

empty_kct.dump()
>>> []