Find the top 100 most frequent long values in a large (4 billion) data file
For looking at the code, a good starting point is src/main/kotlin/org/xor/top100/Main.kt
. The approach is as follows:
- Generated a data file
- Sort and merge. Since the file would not fit in the memory, we are doing an external sort. Chunks of the data file
are sorted individually (using
MappedByteBuffer
s)
and then merged into a fully sorted file. - The top 100 most frequent values are then extracted. Because the data is sorted, the distinct values are contiguous and counting them while traversing the data once, gives us the global frequency.
Running with 8 gigabytes of memory ( -Xmx8G -Xms8G
) we get:
- Generated a data file: 7.5G in 33 sec.
- Sort and merge: 160 sec.
- Top 100: 25 sec.
Total time: 220 sec.
- Generated a data file: 30G in 140 sec.
- Sort and merge: ~ 25 min.
- Top 100: 100 sec.
Total time: ~ 30 min