-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TLB aliasing example #7
base: master
Are you sure you want to change the base?
Conversation
Thank you, this is awesome! 🎈 Would you mind if I refactored it a little to bring it closer to the other examples? I would also probably remove the Also a hint could be added to the README that the I suspect that with large strides some performance will be lost because the hardware prefetcher will not be able to prefetch such large strides. Here the L1 cache miss is twice as big with a larger stride and same count and the runtime is twice slower. This may overshadow the cost of the TLB misses. $ perf stat -edTLB-load-misses,dTLB-store-misses,L1-dcache-load-misses tlb-alias 2048 1
2308
1086.64 misses per repetition (217327925 total)
271 524 493 dTLB-load-misses (79,92%)
138 841 752 dTLB-store-misses (79,96%)
568 777 142 L1-dcache-load-misses (57,20%)
$ perf stat -edTLB-load-misses,dTLB-store-misses,L1-dcache-load-misses tlb-alias 2048 16
4661
1206.49 misses per repetition (241297065 total)
300 500 674 dTLB-load-misses (80,00%)
108 117 375 dTLB-store-misses (79,98%)
903 033 508 L1-dcache-load-misses (57,12%) |
Refactoring for consistency is fine with me. I wasn't aware of the |
I agree, though this won't work directly on Windows, so it might as well link to something like this: https://stackoverflow.com/a/4823889/1107768. |
I put the modified version here: https://github.com/Kobzol/hardware-effects/tree/tlb-aliasing. With |
I tried your branch vs. mine and with |
With the I can't reproduce your numbers for the L2 TLB. According to wikichip and cpuid, my CPU (Kaby Lake) should have 12-way associative shared LTB with 1536 entries, therefore there should be 128 sets and increments of 128 should alias. But running Either I've made a mistake somewhere or the TLB is using a different strategy for indexing into the cache (hashing?). Do you still get massive increments in TLB misses when going with count over 8 with the I disabled (transparent) hugepages, that's the only thing that I know of that could influence it. |
The results I get comparing the two branches mostly differ in cache miss counts, but otherwise, yes I still get a jump in L2 TLB misses going from How do these compare on your CPU: |
This stuff is weird, I'm getting totally different results with your and my code. I tracked it down to the method of allocation: void* mem =
mmap((void*)Tebibytes(2ul), block_size * count, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED, -1, 0); I would expect With
With
|
That's definitely even weirder. I didn't have a reason for choosing |
I changed the program to receive the $ perf stat -edTLB-load-misses,dTLB-store-misses tlb-aliasing 1536 1 1
14.91 misses per repetition (1490712 total)
159706 us
1 493 566 dTLB-load-misses
3 380 116 dTLB-store-misses
$ perf stat -edTLB-load-misses,dTLB-store-misses tlb-aliasing 1536 1 2
0.00 misses per repetition (16 total)
213354 us
522 dTLB-load-misses
92 dTLB-store-misses Even if I add |
Is your branch behaving like mine with |
It's private, yet I got almost the same results as your version with MAP_SHARED. I checked out a clean copy of your branch and tested it today with both MAP_PRIVATE and MAP_SHARED and now it behaves reasonably. I don't remember whether it was doing this on my local or work PC, but let's not dabble in it, I probably made a mistake somewhere along the way. Sorry for that. I tested multiple configurations and got these results: 19.20 misses per repetition (1919587 total)
208376 us
Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 1536 1':
309 409 945 dTLB-loads (65,60%)
2 157 857 dTLB-load-misses # 0,70% of all dTLB cache hits (65,59%)
2 465 238 dTLB-misses # 0,80% of all dTLB cache hits (66,89%)
3 660 753 dTLB-store-misses (68,55%)
4 393 181 dtlb_store_misses.miss_causes_a_walk (67,76%)
0,209603618 seconds time elapsed
1021.20 misses per repetition (102120216 total)
1331882 us
Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 2304 1':
463 318 338 dTLB-loads (66,38%)
153 831 597 dTLB-load-misses # 33,20% of all dTLB cache hits (66,59%)
153 920 401 dTLB-misses # 33,22% of all dTLB cache hits (66,90%)
77 072 478 dTLB-store-misses (66,99%)
92 099 645 dtlb_store_misses.miss_causes_a_walk (66,72%)
1,333054803 seconds time elapsed
1371.14 misses per repetition (137113934 total)
1886300 us
Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 3072 1':
616 466 708 dTLB-loads (66,51%)
207 158 225 dTLB-load-misses # 33,60% of all dTLB cache hits (66,52%)
207 395 001 dTLB-misses # 33,64% of all dTLB cache hits (66,74%)
101 289 174 dTLB-store-misses (66,95%)
122 304 010 dtlb_store_misses.miss_causes_a_walk (66,75%)
1,887647856 seconds time elapsed
0.00 misses per repetition (0 total)
862 us
Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 12 128':
2 607 518 dTLB-loads
722 dTLB-load-misses # 0,03% of all dTLB cache hits
722 dTLB-misses # 0,03% of all dTLB cache hits
244 dTLB-store-misses
<not counted> dtlb_store_misses.miss_causes_a_walk (0,00%)
0,001625369 seconds time elapsed
0.00 misses per repetition (0 total)
1116 us
Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 16 128':
3 415 369 dTLB-loads
679 dTLB-load-misses # 0,02% of all dTLB cache hits
679 dTLB-misses # 0,02% of all dTLB cache hits
250 dTLB-store-misses
<not counted> dtlb_store_misses.miss_causes_a_walk (0,00%)
0,001815701 seconds time elapsed
0.00 misses per repetition (0 total)
2137 us
Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 24 128':
5 027 968 dTLB-loads
944 dTLB-load-misses # 0,02% of all dTLB cache hits
944 dTLB-misses # 0,02% of all dTLB cache hits
314 dTLB-store-misses
<not counted> dtlb_store_misses.miss_causes_a_walk (0,00%)
0,002928527 seconds time elapsed
0.00 misses per repetition (0 total)
836 us
Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 12 32':
2 607 048 dTLB-loads
712 dTLB-load-misses # 0,03% of all dTLB cache hits
712 dTLB-misses # 0,03% of all dTLB cache hits
251 dTLB-store-misses
<not counted> dtlb_store_misses.miss_causes_a_walk (0,00%)
0,001495578 seconds time elapsed
0.00 misses per repetition (0 total)
1605 us
Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 24 32':
5 016 021 dTLB-loads
811 dTLB-load-misses # 0,02% of all dTLB cache hits
811 dTLB-misses # 0,02% of all dTLB cache hits
292 dTLB-store-misses
<not counted> dtlb_store_misses.miss_causes_a_walk (0,00%)
0,002236586 seconds time elapsed
0.00 misses per repetition (0 total)
810 us
Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 12 64':
2 611 794 dTLB-loads
725 dTLB-load-misses # 0,03% of all dTLB cache hits
725 dTLB-misses # 0,03% of all dTLB cache hits
252 dTLB-store-misses
<not counted> dtlb_store_misses.miss_causes_a_walk (0,00%)
0,001497991 seconds time elapsed
0.00 misses per repetition (0 total)
1657 us
Performance counter stats for '../cmake-build-release/tlb-aliasing/tlb-aliasing 24 64':
5 017 697 dTLB-loads
697 dTLB-load-misses # 0,01% of all dTLB cache hits
697 dTLB-misses # 0,01% of all dTLB cache hits
293 dTLB-store-misses
<not counted> dtlb_store_misses.miss_causes_a_walk (0,00%)
0,002352180 seconds time elapsed Offsets 32/64/128 don't increase the TLB misses when going over the associativity size. Maybe the Skylake TLB prefetcher got better and can avoid the misses when it recognizes a certain pattern? |
I'm only surprised by the 128 offset results; 16,128 and 24,128 should cause increased misses. The other thing I don't understand though, (and maybe it's not important) is what is a TLB store miss? My understanding is that normal cache access can be read or write, but a TLB access would also be a read in order to do address translation. A TLB read miss would then cause it to fetch the page table and populate that TLB entry; that would be the only way to write to the TLB, and calling that a "miss" doesn't make sense. So I'm probably missing something there. I wouldn't expect amazing prefetcher performance on the 128 offset examples since the 1536,1 is still incurring some misses. The only other thing I can think of is that the perf stats for Skylake might be looking at the L1 TLB instead? But that should've had much more dramatically different results anyway, so that doesn't really make sense either. When I have more time, I will try to look through the Intel manuals to see if I can find anything that would explain this. That could be a while though. |
IMO TLB store miss is a write access that didn't find the page in TLB and TLB load miss is a read access that didn't find the page in TLB. I looked briefly into the manual and something that may affect it is hyperthreading, because the TLB will be partitioned differently with HT active. However if I understand it correctly, it probably only affects the ITLB entries. Btw I found out that the |
This is in reference to #4. It makes direct use of perf events in Linux for measuring TLB misses, but also times each run. Overall, the results I got seem to make sense and are outlined in the REAMDE.md, but there are a few cases that don't (like
./tlb-aliasing 2048 1
doesn't give close to 2048 misses per iteration, which I would expect in a 1024 entry TLB).PTAL, thanks!