1BRC in C (< 1.7 seconds) #46
Replies: 8 comments 24 replies
-
I just added in concurrency: fe6d42d. After distributing the work among 12 threads the runtime is down to well under 5 seconds on my laptop, from 30s for the single threaded solution. |
Beta Was this translation helpful? Give feedback.
-
In Simply find the newline character, then cast the preceding 8 chars into an int. You can now add up these ints, keeping count of how many you added. The ints are comparable (ie When you have the total, you can then parse the result. Quite a bit of thought needed to see how negatives impact the result, but I believe its possible. |
Beta Was this translation helpful? Give feedback.
-
What is the time of the fastest Java submission in your machine, please? |
Beta Was this translation helpful? Give feedback.
-
I try to run your code with my example generated data, but the result is different from reference. Your result: Manually looking:
It seems there's a problem with your min values. |
Beta Was this translation helpful? Give feedback.
-
thanks for the repository |
Beta Was this translation helpful? Give feedback.
-
I ran your code on Dual EPYC 9354 machines used here: #138
You should -150ms from Update: ran again, this time the PC is less busy
I used an older |
Beta Was this translation helpful? Give feedback.
-
Just a friendly suggestion: with hyperfine (https://github.com/sharkdp/hyperfine) the figures should be more reliable and comparable. |
Beta Was this translation helpful? Give feedback.
-
Hi, could you check your latest file with 10k unique input tests? I ran but it seg fault. Also I just noticed your code doesn't have any string comparison, and relies on hash value completely, right? In that case hash collision will give wrong results. |
Beta Was this translation helpful? Give feedback.
-
You nerd sniped me with this fun challenge! I wanted to see how fast it would be when implemented in standard C99.
Code: https://github.com/dannyvankooten/1brc
Some details: https://www.dannyvankooten.com/blog/2024/1brc/
It finishes in just under
30 seconds5 seconds3 seconds2.2 seconds1.6 seconds on my AMD Ryzen 4800U laptop CPU.With concurrency I'm hoping to get that down to well under 10 seconds.Current tricks over a naive chunked read (from most significant to micro):
mmap
entire datafile into memory. On consecutive runs, this means file is still in pagecache.-
and.
position and then unroll rest of parsing, while parsing as integer so we can do integer math during rest of program.;
separator while hashing in the same loop.b < a
requires a branch instruction):Notes
mmap
for files this large is apparently notoriously slow on MacOS and we're probably better off reading chunked on that platform, but since the official challenge is specifically targeting Linux, so will I.The performance difference between a warm and a hot pagecache is quite extreme. Run
echo 3 > /proc/sys/vm/drop_caches
to drop your pagecache, then run the program twice in a row. It's not uncommon for the second run to be well over twice as fast.Comparision vs current leading Java implementations
Since I don't have access to a Hetzner CCX33 box, here are the reference times for the currently leading Java implementations from the official challenge when I run them on my machine.
Runtime on 5995WX with 128 threads:
@lehuyduc was so kind to run this solution on a Threadripper 5995WX with 128 threads:
// 5995 WX 128 threads
Runtime inside main = 0.256s
munmap cost = 0.188s
free memory cost = 0.001s
real 0m0.449s
user 0m26.775s
sys 0m0.901s
#138 contains his C++ implementation with SIMD which runs even faster than this.
Progressions
You can find the average runtime (across 5 consecutive runs) for the various states of the program below, from baseline to the final and fully optimized version. Because I have no patience, this was run on a measurements file with only 100M rows.
Beta Was this translation helpful? Give feedback.
All reactions