Top100

Find the top 100 most frequent long values in a large (4 billion) data file

For looking at the code, a good starting point is src/main/kotlin/org/xor/top100/Main.kt. The approach is as follows:

Generated a data file
Sort and merge. Since the file would not fit in the memory, we are doing an external sort. Chunks of the data file are sorted individually (using MappedByteBuffers)
and then merged into a fully sorted file.
The top 100 most frequent values are then extracted. Because the data is sorted, the distinct values are contiguous and counting them while traversing the data once, gives us the global frequency.

Running with 8 gigabytes of memory ( -Xmx8G -Xms8G ) we get:

Total time: 220 sec.

Total time: ~ 30 min

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback