1BRC in C (< 1.7 seconds) #46

dannyvankooten · 2024-01-03T17:34:04Z

dannyvankooten
Jan 3, 2024

You nerd sniped me with this fun challenge! I wanted to see how fast it would be when implemented in standard C99.

Code: https://github.com/dannyvankooten/1brc
Some details: https://www.dannyvankooten.com/blog/2024/1brc/

It finishes in just under ~~30 seconds~~ ~~5 seconds~~ ~~3 seconds~~ ~~2.2 seconds~~ 1.6 seconds on my AMD Ryzen 4800U laptop CPU. ~~With concurrency I'm hoping to get that down to well under 10 seconds.~~

Current tricks over a naive chunked read (from most significant to micro):

Parallel processing in completely separate chunks using all available logical cores. The aggregated results from each thread are in turn aggregated once all threads have finished.
City names are hashed using a simple (but fast) multiplication hash with a hashtable that has a load factor of well under 0.2.
mmap entire datafile into memory. On consecutive runs, this means file is still in pagecache.
Custom float parsing: check for - and . position and then unroll rest of parsing, while parsing as integer so we can do integer math during rest of program.
Custom parser for city name that finds ; separator while hashing in the same loop.
Branchless min/max (except on some machines where b < a requires a branch instruction):

int min(int a, int b) {
    return a ^ ((b ^ a) & -(b < a));
}

Notes

mmap for files this large is apparently notoriously slow on MacOS and we're probably better off reading chunked on that platform, but since the official challenge is specifically targeting Linux, so will I.
The performance difference between a warm and a hot pagecache is quite extreme. Run echo 3 > /proc/sys/vm/drop_caches to drop your pagecache, then run the program twice in a row. It's not uncommon for the second run to be well over twice as fast.

Comparision vs current leading Java implementations

Since I don't have access to a Hetzner CCX33 box, here are the reference times for the currently leading Java implementations from the official challenge when I run them on my machine.

#	Result (m:s.ms)	Implementation	Language	Submitter
?	00:01.590	link	C	Danny van Kooten
1.	00:06.131	link	21.0.1-graalce	Sam Pullara
2.	00:06.421	link	21.0.1-graalce	Roy van Rijn

Runtime on 5995WX with 128 threads:

@lehuyduc was so kind to run this solution on a Threadripper 5995WX with 128 threads:

// 5995 WX 128 threads
Runtime inside main = 0.256s
munmap cost = 0.188s
free memory cost = 0.001s

real 0m0.449s
user 0m26.775s
sys 0m0.901s

#138 contains his C++ implementation with SIMD which runs even faster than this.

Progressions

You can find the average runtime (across 5 consecutive runs) for the various states of the program below, from baseline to the final and fully optimized version. Because I have no patience, this was run on a measurements file with only 100M rows.

1.c runtime=[ 55.86 59.09 64.28 63.63 56.08 ] average=59.79s   linear-search by city name (baseline)
2.c runtime=[ 9.14 9.31 9.35 9.05 9.30 ] average=9.23s hashmap with linear probing
3.c runtime=[ 4.27 4.51 4.47 4.28 4.25 ] average=4.36s custom temperature float parser instead of strod
4.c runtime=[ 2.38 2.41 2.46 2.40 2.39 ] average=2.41s fread with 64MB chunks instead of line-by-line
5.c runtime=[ 2.13 1.99 1.99 2.00 2.05 ] average=2.03s unroll parsing of city name and generating hash
6.c runtime=[ 0.49 0.49 0.49 0.50 0.50 ] average=0.49s parallelize across 16 threads
7.c runtime=[ 0.30 0.25 0.23 0.24 0.24 ] average=0.25s mmap entire file instead of fread in chunks

dannyvankooten · 2024-01-04T09:27:10Z

dannyvankooten
Jan 4, 2024
Author

I just added in concurrency: fe6d42d. After distributing the work among 12 threads the runtime is down to well under 5 seconds on my laptop, from 30s for the single threaded solution.

10 replies

c-blake Jan 5, 2024

You should not need MAP_ANONYMOUS to avoid a -1/errno (an strace for me on Linux-6.6:

openat(AT_FDCWD, "Chicago", O_RDONLY|O_NONBLOCK) = 3
newfstatat(3, "", {st_mode=S_IFREG|0600, st_size=9685228, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 9685228, PROT_READ, MAP_SHARED|21<<MAP_HUGE_SHIFT, 3, 0) = 0x14adad4bd000
close(3)

However, an experiment I just did in light of your trouble suggests to me that the kernel may only do a best effort here, giving you regular pages if it cannot find enough huge. { Something else misled me to think just including it in the flag mask was "enough", though I vaguely recall discovering it was not enough like 6 years ago.. I had hoped it had gotten easier. Sorry. }

There may be an easier way, but setting up a hugetlbfs file system is a reliable way I know to be sure you get huge pages backing a file mmap. So, I guess you should do that after all. Sorry I suggested it was a skippable step. grep hugetlbfs /proc/filesystems should tell you if you need to recompile your Linux kernel (or elsewise get one that supports it). And you may also profitably check the kernel config for all the CONFIG_HUGE* settings (e.g. /proc/config.gz). I'll add a note to that HOW2 block about checking kernel support.

Once you are mmap'ing files opened in hugetlbfs, you needn't even pass the MAP_HUGE* to get huge pages from mmap, BUT files in this filesystem are all rounded up to the next 2MB. So, you need to preserve/propagate file size somehow to know how much data to look at. { That same property made me write my own cpHuge - also mentioned in the doc file notes, along with a note about difficulty of use, but maybe I just don't know the easier way - It sure would be easier to use if you could fstat() a regular file for its true size and then mmap(HUGE) without any direct hugetlbfs involvement, even if there were still a requirement to have a mounted hugetlbfs with enough free space. }

As motivation for all of this, you may see >1.3x speed-up because there will be a few hundred thousand less switches and simple "user,sys"-time accounting cannot account for all the disruption of the CPU happy paths from the page faults. The only way to really know is to measure the improvement. { Another idea vaguely along these lines is more recent Apple Silicon which has 16K pages (only 4x bigger not 512x), (& maybe with Asahi Linux to not deal with who-knows what Apple does in their page fault handler). }

sharpobject Jan 5, 2024

I think with MAP_HUGETLB and without MAP_POPULATE, you can get a bus error from the page fault handler if there aren't enough physical huge pages to serve the request.

For the most part I'll be ignoring MAP_HUGETLB because @gunnarmorling says he will not reserve huge pages for us.

It's definitely correct that any solutions that use mmap and approach optimal codegen for the parser will spend a lot of their time on page faults. That's unfortunate. I expect mmap to still be better than other options because of high syscall overhead assuming mitigations are on and because it avoids copies compared to syscalls.

c-blake Jan 5, 2024

FWIW, I suspect, IF the file resides in a hugetlbfs, the pages are guaranteed no matter the over-commit settings and that residence seems to be needed (unless you know an easier way?). I agree "going huge" remains overly difficult (as I said a few ways) which is sad.

For me, 4k pages were making single core 2.6x slower. Also FWIW, MAP_POPULATE seems to slightly slow the 4K page case (451.02+-0.78ms vs 439.93+-0.75ms, w/huge being 169.41+-0.53ms). I could imagine that delta being kernel vsn-dependent (& I'd naively expect populating all at once to be faster than populating on-demand.). At 10 sigma with 32 repeats of each case, it seems like a pretty real, if small, effect. { Also recall, my binary data is much smaller at only 4GB. So, that 270ms 4K-2M delta needs upscaling in the parsing case (eg. 810ms for 12G), and times on other machines may vary a bit (and, IRL all kinds of complex branch predictor / cache / etc. things are also in the mix). }

Anyway, it's fine for you (or anyone!) to ignore it & focus on Challenge Rules. I actually liked the fixed deployment aspect. I mostly mentioned it here since it was a big effect, Danny's running on his laptop not Gunnar's cloud instances, is using mmap already, and this specific aspect potentially applies to many more IO situations than just this challenge. So, learning to deal with huge pages may pay dividends for him in other endeavors beyond just a nerd snipe.

sharpobject Jan 5, 2024

For sure. Using huge pages tends to be a 10-20% win for normal software and it's much easier than writing string parsers using simd intrinsics.

synap5e Jan 6, 2024

Tiny edge case, but for super small measurement cardinalities it's possible for the final chunk's start index to fall on the last line resulting in that line being ignored.

Hello1024 · 2024-01-04T11:35:03Z

Hello1024
Jan 4, 2024

In parse_number, you don't actually need to do all of the parsing...

Simply find the newline character, then cast the preceding 8 chars into an int.

You can now add up these ints, keeping count of how many you added. The ints are comparable (ie (int)"York;21.2" > (int)"York;20.2")

When you have the total, you can then parse the result. Quite a bit of thought needed to see how negatives impact the result, but I believe its possible.

3 replies

maver1ck Jan 4, 2024

What about negative numbers ?

sharpobject Jan 4, 2024

Even if all the numbers are positive this won't work. Carries from the tenths digit ends up getting into the ones digit and you can't tell how many times that happened to fix it.

(This is assuming you bswapped the integers before adding them, if you didn't then you lose immediately because your carries are going the wrong way)

maver1ck Jan 5, 2024

Yep. ';' > '0-9'

tivrfoa · 2024-01-05T11:26:23Z

tivrfoa
Jan 5, 2024

You nerd sniped me with this fun challenge! I wanted to see how fast it would be when implemented in standard C99.

Code: https://github.com/dannyvankooten/1brc

It finishes in just under ~~30 seconds~~ 5 3 seconds on my AMD Ryzen 4800U laptop CPU. ~~With concurrency I'm hoping to get that down to well under 10 seconds.~~

Current tricks over a naive chunked read:
* Entire file is `mmap`'d into memory.

* Because all floating points are rounded to 1 decimal we can multiply by 10 and use integer math from then on.

* FNV1-a hash on entire city name with linear probing to handle collissions and a load factor of well under 0.5.

* Minimal memory copy's.

* Process in 12 concurrent separate chunks, then aggregate once all threads have finished.

What is the time of the fastest Java submission in your machine, please?

1 reply

dannyvankooten Jan 5, 2024
Author

Had to setup a Java environment for this, but https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_spullara.java on 21.0.1-graalce finishes in about 5.98s for me. Faster than I expected it to be, too be honest (given that it's Java). Impressive stuff!

real    0m5.981s
user    0m49.607s
sys     0m14.849s

lehuyduc · 2024-01-05T14:44:22Z

lehuyduc
Jan 5, 2024

I try to run your code with my example generated data, but the result is different from reference.
Data file: https://drive.google.com/file/d/1HEyNw4M453n0tnuaAm9nwaCiLydQYnpo/view?usp=sharing

Your result: {Abha=-22.5/18.0/69.9, Abidjan=-30.0/26.0/78.1, Abéché=-16.4/29.4/81.0, Accra=-18.6/26.4/75.5, Addis Ababa=-29.9/16.1/64.7, Adelaide=-29.2/17.3/68.9, Aden=-19.9/29.1/79.2
Reference: {Abha=-37.5/18.0/69.9, Abidjan=-30.0/26.0/78.1, Abéché=-23.6/29.4/81.0, Accra=-23.1/26.4/75.5, Addis Ababa=-32.0/16.0/64.7, Adelaide=-32.4/17.3/68.9, Aden=-20.3/29.1/79.2

Manually looking:

>cat measurements.txt | grep "Abha;-37.5"
Abha;-37.5
>cat measurements.txt | grep "Accra;-23.1"
Accra;-23.1
>cat measurements.txt | grep "Addis Ababa;-32.0"
Addis Ababa;-32.0

It seems there's a problem with your min values.
Could you test again with this input? Thanks!

9 replies

lehuyduc Jan 6, 2024

struct Group {
  unsigned int count;
  int sum;
  int min;
  int max;
  char *label;
};

I found small bug. sum should be int64_t here, else there might be int overflow. The code will be ~2-3% slower

AlexanderYastrebov Jan 6, 2024

I am not sure if C compiler is smart enough to figure this out, maybe you can try changing parse_number like so

if data[1] == '.' {
  // 1.2\n
  temp = int64(data[0])*10 + int64(data[2]) - '0'*(10+1)
  data = data[4:]
  // 12.3\n
} else {
  temp = int64(data[0])*100 + int64(data[1])*10 + int64(data[3]) - '0'*(100+10+1)
  data = data[5:]
}

i.e. convert from ascii the final result instead of each digit.
I came up with this myself but then also saw this is used in @royvanrijn java impl #141:

byte dot;
if ((dot = (bb.get(delimiterPointer + 1))) == '.') {
    measuredValue = neg * ((bb.get(delimiterPointer)) * 10 + (bb.get(delimiterPointer + 2)) - 528);
    endPointer = delimiterPointer + 3;
}
else {
    measuredValue = neg * (bb.get(delimiterPointer) * 100 + dot * 10 + bb.get(delimiterPointer + 3) - 5328);
    endPointer = delimiterPointer + 4;
}

dannyvankooten Jan 6, 2024
Author

@AlexanderYastrebov Love the idea! I just implemented it but just looking at the runtime, I guess the compiler (gcc14) already took care of this. Sadly I don't know enough ASM to confirm.

AlexanderYastrebov Jan 6, 2024

I don't think it can optimize without knowing the data due to possible over/underfows.

Here https://godbolt.org/z/x5neaoEaf you may see (maximize diff view) new version uses fewer instructions that subtract '0': sub eax, 48 and lea edx, [rax-48]

Most likely its not a game changer but still nice to have.

AlexanderYastrebov Jan 8, 2024

I also saw a branchless version of parseNumber written by @merykitty

1brc/src/main/java/dev/morling/onebrc/CalculateAverage_merykitty.java

Lines 151 to 184 in 23913da

    
           // Parse a number that may/may not contain a minus sign followed by a decimal with 
        
           // 1 - 2 digits to the left and 1 digits to the right of the separator to a 
        
           // fix-precision format. It returns the offset of the next line (presumably followed 
        
           // the final digit and a '\n') 
        
           private static long parseDataPoint(Aggregator aggr, MemorySegment data, long offset) { 
        
               long word = data.get(JAVA_LONG_LT, offset); 
        
               // The 4th binary digit of the ascii of a digit is 1 while 
        
               // that of the '.' is 0. This finds the decimal separator 
        
               // The value can be 12, 20, 28 
        
               int decimalSepPos = Long.numberOfTrailingZeros(~word & 0x10101000); 
        
               int shift = 28 - decimalSepPos; 
        
               // signed is -1 if negative, 0 otherwise 
        
               long signed = (~word << 59) >> 63; 
        
               long designMask = ~(signed & 0xFF); 
        
               // Align the number to a specific position and transform the ascii code 
        
               // to actual digit value in each byte 
        
               long digits = ((word & designMask) << shift) & 0x0F000F0F00L; 
        
               // Now digits is in the form 0xUU00TTHH00 (UU: units digit, TT: tens digit, HH: hundreds digit) 
        
               // 0xUU00TTHH00 * (100 * 0x1000000 + 10 * 0x10000 + 1) = 
        
               // 0x000000UU00TTHH00 + 
        
               // 0x00UU00TTHH000000 * 10 + 
        
               // 0xUU00TTHH00000000 * 100 
        
               // Now TT * 100 has 2 trailing zeroes and HH * 100 + TT * 10 + UU < 0x400 
        
               // This results in our value lies in the bit 32 to 41 of this product 
        
               // That was close :) 
        
               long absValue = ((digits * 0x640a0001) >>> 32) & 0x3FF; 
        
               long value = (absValue ^ signed) - signed; 
        
               aggr.min = Math.min(value, aggr.min); 
        
               aggr.max = Math.max(value, aggr.max); 
        
               aggr.sum += value; 
        
               aggr.count++; 
        
               return offset + (decimalSepPos >>> 3) + 3; 
        
           }

and wrote my own version to better understand it
AlexanderYastrebov@08e3037

It did not give any boost for my implementation which is overwhelmed by hashmap access.

Maybe you decide to try it out. Since it reads 8 bytes at once you'd need to allow segments of file to overlap the beginning of the next segment several bytes and process last entry separately.

galetska228 · 2024-01-06T08:46:03Z

galetska228
Jan 6, 2024

thanks for the repository

0 replies

lehuyduc · 2024-01-08T09:52:42Z

lehuyduc
Jan 8, 2024

I ran your code on Dual EPYC 9354 machines used here: #138

Runtime inside main = 350.263620ms
munmap cost = 365.615137ms
free memory cost = 20.820573ms
real    0m0.739s
user    0m29.500s
sys     0m2.256s

You should -150ms from real and munmap cost, because that's the difference between weekend and workday performance on this PC.

Update: ran again, this time the PC is less busy

Runtime inside main = 343.277755ms
munmap cost = 216.992869ms
free memory cost = 19.699456ms
real    0m0.582s
user    0m29.590s
sys     0m2.204s

I used an older analyze.c on your repo to test. Will update later dannyvankooten/1brc#2

0 replies

demming · 2024-01-09T21:04:19Z

demming
Jan 9, 2024

Just a friendly suggestion: with hyperfine (https://github.com/sharkdp/hyperfine) the figures should be more reliable and comparable.

1 reply

AlexanderYastrebov Jan 9, 2024

1brc moved to hyperfine, see #182

lehuyduc · 2024-01-13T14:16:38Z

lehuyduc
Jan 13, 2024

Hi, could you check your latest file with 10k unique input tests? I ran but it seg fault.

Also I just noticed your code doesn't have any string comparison, and relies on hash value completely, right? In that case hash collision will give wrong results.

https://github.com/gunnarmorling/1brc/blob/main/src/test/resources/samples/measurements-10000-unique-keys.txt

0 replies

1BRC in C (< 1.7 seconds) #46

Notes

Comparision vs current leading Java implementations

Runtime on 5995WX with 128 threads:

Progressions

Replies: 8 comments · 24 replies

dannyvankooten Jan 4, 2024 Author

dannyvankooten Jan 5, 2024 Author

dannyvankooten Jan 6, 2024 Author

Replies: 8 comments 24 replies

dannyvankooten
Jan 4, 2024
Author

dannyvankooten Jan 5, 2024
Author

dannyvankooten Jan 6, 2024
Author