go-kmv is an adaptive version of K-minimum values algorithm for cardinality estimation
This repository provides:
- A library for your own Go programs
- A cmd tool which estimates the cardinality reading from the stdin (so you can use it with the pipe
|
linux operator)
The formula used for estimating the cardinality is exactly the same described in the paper Counting distinct elements in a data stream. What makes this implementation interesting is the use of an adaptive table which grows in order to provide better estimations. The implementation of the adaptive-table can be found here
After compiling cmd/main.go
we can run the algorithm from our terminal
$ go build -o go-kmv main.go
# Output
# ${CardinalityEstimation} ${ProssecedElements} ${TableSize}
$ ./go-kmv < ../data/bible.txt
33938 824036 465
# If we (really) count them
$ tr ' ' '\n' < ../data/bible.txt | sort | uniq -c | wc -l
34040
If what you want is to use it as a dependency for your project
package main
import gokmv "github.com/positiveblue/go-kmv"
func main() {
// Get dataStream
dataStream := myDataStream()
// Create the estimator
initialSize := 64
estimator := gokmv.NewKMV(initialSize)
for element := range dataStream {
// element has to be a UInt64
estimator.InsertUint64(element)
}
estimator.Size() // returns the table size
estimator.ElementsAdded() // returns the total elements that we processed
estimator.EstimateCardinality() // returns the cardinality estimation
}
Because of the lack of generics in Go go-kmv only provides Insert
functions for Uint64
and strings
. If you want to use your own hash functions or add new types you can just create your own function:
// Insert my type to the table
// Using my hash function
func (kmv *KMV) InsertMyType(s string) {
// Remember to use the internal seed to have reproducible results
hash := myHashFunction.Sum64([]byte(s), kmv.Seed())
// The has has to return a Uint64
kmv.InsertUint64(hash)
}
Cardinalty Estimation is considered solved under all meanings. Nowadays computers have enough memory for computing the cardinality of small sets and for extream cases (big data)algorithms like HyperLogLog and KMV already give an accuracy of ~98% using a few bytes of memory.
In real life what people usually use is an implementation of HyperLogLog with a table size from about 128 to 4096. HyperLogLog and all the algorithms of its family can only use tables of size 2^k
where k is a positive integer. go-kmv does not have that limitation and automatically provides a good trade-off without knowing in advance the order of distinct elements that we have to estimate.
The current implementation grows with a factor of klog(n)
where k
is the inital table size and n
is the number of disctinct elements in the stream. That means that runing go-kmv with an initialSize
of 64 and processing and stream of 10^6 elements the final table size will be about ~600 and the accuracy of the estimation will be ~98.00%.
A critical part to achive meaningful results is to use a good hash function (where good = few colisions). Hash Functions like FNV, from the go stdlib are not good enough to ensure the theoretical results. Other algorithms like AES provide the best results but are slower and it seems a bit overkill for this implementation. Murmur3 provides the best ratio results/processing time and it has been used in this implementation.