Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PMU musings #808

Open
lukego opened this issue Mar 8, 2016 · 6 comments
Open

PMU musings #808

lukego opened this issue Mar 8, 2016 · 6 comments
Labels

Comments

@lukego
Copy link
Member

lukego commented Mar 8, 2016

The PMU is a powerful piece of hardware. We are only scratching the surface of how it can be used. This issue is meant to remind us of this fact and perhaps inspire the next step in our usage.

The idea I came up with is simple: to run the snabbmark basic1 benchmark many times, each time with a different set of PMU events, and each time with /packet counts reported.

Script:

-- Specify groups of PMU events by name pattern.
-- The CPU can track up to 8 events at a time.
-- See lib/pmu_cpu.lua for available counters.
local event_groups = {
   {"^l2_rqsts"},
   {"^l2_trans"},
   {"^dtlb_store"},
   {"^dtlb_load"},
   {"^offcore_requests"},
   {"^itlb_misses"},
   {"^cycle_activity"},
   {"^uops_executed_port"},
   {"^mem_trans_retired[.]load_latency_gt_"},
   {"^mem_uops_retired"},
   {"^mem_load_uops_retired"}
}

local snabbmark = require("program.snabbmark.snabbmark")
local pmu = require("lib.pmu")

local npackets = 10e6

-- Run basic1 benchmark separately for each group of events.
for _, events in ipairs(event_groups) do
   engine.configure(config.new())
   pmu.profile(
      function () snabbmark.basic1(npackets) end,
      events,
      {packet = npackets})
end

Output:


Processed 10.2 million packets in 0.35 seconds (rate: 29.2 Mpps).
EVENT                                             TOTAL     /packet
cycles                                    1,235,445,846     123.545
ref_cycles                                            0       0.000
instructions                              2,733,223,244     273.322
l2_rqsts.all_code_rd                          1,073,783       0.107
l2_rqsts.all_demand_data_rd                  51,378,621       5.138
l2_rqsts.all_demand_miss                      1,534,342       0.153
l2_rqsts.all_demand_references               97,109,913       9.711
l2_rqsts.all_pf                              21,668,114       2.167
l2_rqsts.all_rfo                             44,657,508       4.466
l2_rqsts.code_rd_hit                            770,365       0.077
l2_rqsts.code_rd_miss                           291,776       0.029
packet                                       10,000,000       1.000
load: time: 0.34s  fps: 58,662,064 fpGbps: 32.381 fpb: 510 bpp: 60   sleep: 0   us

Processed 10.0 million packets in 0.34 seconds (rate: 29.5 Mpps).
EVENT                                             TOTAL     /packet
cycles                                    1,193,594,857     119.359
ref_cycles                                            0       0.000
instructions                              2,649,570,794     264.957
l2_trans.all_pf                               4,956,849       0.496
l2_trans.all_requests                       187,960,552      18.796
l2_trans.code_rd                                998,689       0.100
l2_trans.demand_data_rd                      50,124,949       5.012
l2_trans.l1d_wb                              65,830,399       6.583
l2_trans.l2_fill                              4,488,218       0.449
l2_trans.l2_wb                                1,015,145       0.102
l2_trans.rfo                                 43,511,452       4.351
packet                                       10,000,000       1.000
load: time: 0.35s  fps: 57,915,032 fpGbps: 31.969 fpb: 510 bpp: 60   sleep: 0   us

Processed 10.2 million packets in 0.35 seconds (rate: 29.0 Mpps).
EVENT                                             TOTAL     /packet
cycles                                    1,227,815,425     122.782
ref_cycles                                            0       0.000
instructions                              2,718,874,369     271.887
dtlb_store_misses.miss_causes_a_walk              1,676       0.000
dtlb_store_misses.pde_cache_miss                  1,069       0.000
dtlb_store_misses.stlb_hit                      630,269       0.063
dtlb_store_misses.stlb_hit_2m                        33       0.000
dtlb_store_misses.stlb_hit_4k                   630,236       0.063
dtlb_store_misses.walk_completed                    974       0.000
dtlb_store_misses.walk_completed_2m_4m                9       0.000
dtlb_store_misses.walk_completed_4k                 965       0.000
packet                                       10,000,000       1.000
load: time: 0.34s  fps: 58,879,365 fpGbps: 32.501 fpb: 510 bpp: 60   sleep: 0   us

Processed 10.1 million packets in 0.34 seconds (rate: 29.5 Mpps).
EVENT                                             TOTAL     /packet
cycles                                    1,193,183,702     119.318
ref_cycles                                            0       0.000
instructions                              2,667,292,180     266.729
dtlb_load_misses.miss_causes_a_walk              35,047       0.004
dtlb_load_misses.pde_cache_miss                  35,446       0.004
dtlb_load_misses.stlb_hit                     4,021,610       0.402
dtlb_load_misses.stlb_hit_2m                          3       0.000
dtlb_load_misses.stlb_hit_4k                  4,021,607       0.402
dtlb_load_misses.walk_completed                   2,296       0.000
dtlb_load_misses.walk_completed_2m_4m                 5       0.000
dtlb_load_misses.walk_completed_4k                2,291       0.000
packet                                       10,000,000       1.000
load: time: 0.35s  fps: 57,255,918 fpGbps: 31.605 fpb: 510 bpp: 60   sleep: 0   us

Processed 10.1 million packets in 0.35 seconds (rate: 28.8 Mpps).
EVENT                                             TOTAL     /packet
cycles                                    1,227,865,816     122.787
ref_cycles                                            0       0.000
instructions                              2,679,243,427     267.924
offcore_requests.all_data_rd                  1,571,509       0.157
offcore_requests.demand_code_rd                 245,140       0.025
offcore_requests.demand_data_rd                 777,747       0.078
offcore_requests.demand_rfo                     766,501       0.077
offcore_requests_buffer.sq_full                 168,542       0.017
offcore_requests_outstanding.all_data_rd  4,056,122,430     405.612
offcore_requests_outstanding.cycles_with_data_rd  4,056,122,541     405.612
offcore_requests_outstanding.cycles_with_demand_data_rd 11,611,673,598    1161.167
packet                                       10,000,000       1.000
load: time: 0.35s  fps: 57,262,390 fpGbps: 31.609 fpb: 510 bpp: 60   sleep: 0   us

Processed 10.1 million packets in 0.35 seconds (rate: 28.7 Mpps).
EVENT                                             TOTAL     /packet
cycles                                    1,227,327,596     122.733
ref_cycles                                            0       0.000
instructions                              2,691,910,677     269.191
itlb_misses.miss_causes_a_walk                    1,586       0.000
itlb_misses.stlb_hit                                356       0.000
itlb_misses.stlb_hit_2m                               0       0.000
itlb_misses.stlb_hit_4k                             356       0.000
itlb_misses.walk_completed                          654       0.000
itlb_misses.walk_completed_2m_4m                      0       0.000
itlb_misses.walk_completed_4k                       654       0.000
itlb_misses.walk_duration                        45,872       0.005
packet                                       10,000,000       1.000
load: time: 0.34s  fps: 58,803,729 fpGbps: 32.460 fpb: 510 bpp: 60   sleep: 0   us

Processed 10.0 million packets in 0.34 seconds (rate: 29.5 Mpps).
EVENT                                             TOTAL     /packet
cycles                                    1,192,451,970     119.245
ref_cycles                                            0       0.000
instructions                              2,646,178,124     264.618
cycle_activity.cycles_l1d_pending                     0       0.000
cycle_activity.cycles_l2_pending             26,469,475       2.647
cycle_activity.cycles_ldm_pending         2,318,132,932     231.813
cycle_activity.cycles_no_execute            532,781,900      53.278
cycle_activity.stalls_l1d_pending           532,781,900      53.278
cycle_activity.stalls_l2_pending            559,251,375      55.925
cycle_activity.stalls_ldm_pending         2,850,914,832     285.091
packet                                       10,000,000       1.000
load: time: 0.35s  fps: 57,957,221 fpGbps: 31.992 fpb: 510 bpp: 60   sleep: 0   us

Processed 10.2 million packets in 0.35 seconds (rate: 29.1 Mpps).
EVENT                                             TOTAL     /packet
cycles                                    1,227,491,185     122.749
ref_cycles                                            0       0.000
instructions                              2,727,096,840     272.710
uops_executed_port.port_0                   293,957,376      29.396
uops_executed_port.port_0_core              293,957,376      29.396
uops_executed_port.port_1                   369,128,840      36.913
uops_executed_port.port_1_core              369,128,840      36.913
uops_executed_port.port_2                   606,223,909      60.622
uops_executed_port.port_2_core              606,223,909      60.622
uops_executed_port.port_3                   630,122,095      63.012
uops_executed_port.port_3_core              630,122,095      63.012
packet                                       10,000,000       1.000
load: time: 0.35s  fps: 58,311,443 fpGbps: 32.188 fpb: 510 bpp: 60   sleep: 0   us

Processed 10.2 million packets in 0.35 seconds (rate: 29.2 Mpps).
EVENT                                             TOTAL     /packet
cycles                                    1,227,493,960     122.749
ref_cycles                                            0       0.000
instructions                              2,709,107,907     270.911
mem_trans_retired.load_latency_gt_128                 0       0.000
mem_trans_retired.load_latency_gt_16                  0       0.000
mem_trans_retired.load_latency_gt_256                 0       0.000
mem_trans_retired.load_latency_gt_32                  0       0.000
mem_trans_retired.load_latency_gt_4                   0       0.000
mem_trans_retired.load_latency_gt_512                 0       0.000
mem_trans_retired.load_latency_gt_64                  0       0.000
mem_trans_retired.load_latency_gt_8                   0       0.000
packet                                       10,000,000       1.000
load: time: 0.35s  fps: 57,967,957 fpGbps: 31.998 fpb: 510 bpp: 60   sleep: 0   us

Processed 10.2 million packets in 0.35 seconds (rate: 29.1 Mpps).
EVENT                                             TOTAL     /packet
cycles                                    1,227,939,122     122.794
ref_cycles                                            0       0.000
instructions                              2,716,211,116     271.621
mem_uops_retired.all_loads                  893,447,234      89.345
mem_uops_retired.all_stores                 577,123,036      57.712
mem_uops_retired.lock_loads                          44       0.000
mem_uops_retired.split_loads                        991       0.000
mem_uops_retired.split_stores                         0       0.000
mem_uops_retired.stlb_miss_loads                      0       0.000
mem_uops_retired.stlb_miss_stores                     0       0.000
packet                                       10,000,000       1.000
load: time: 0.35s  fps: 57,998,525 fpGbps: 32.015 fpb: 510 bpp: 60   sleep: 0   us

Processed 10.2 million packets in 0.35 seconds (rate: 29.1 Mpps).
EVENT                                             TOTAL     /packet
cycles                                    1,227,718,482     122.772
ref_cycles                                            0       0.000
instructions                              2,717,272,860     271.727
mem_load_uops_retired.hit_lfb                32,476,430       3.248
mem_load_uops_retired.l1_hit                810,899,999      81.090
mem_load_uops_retired.l1_miss                50,662,084       5.066
mem_load_uops_retired.l2_hit                 50,150,753       5.015
mem_load_uops_retired.l2_miss                         0       0.000
mem_load_uops_retired.l3_hit                          0       0.000
mem_load_uops_retired.l3_miss                         0       0.000
packet                                       10,000,000       1.000

No interpretation for now :-). However, this seems like an approach that could be used to gather data for more interesting benchmarks (like mp-ring in #804) and that could also be analyzed for interesting patterns and differences with an R script (like #755) to help explain results and validate intuitions.

@lukego lukego added the idea label Mar 8, 2016
@lukego
Copy link
Member Author

lukego commented Mar 8, 2016

cc @xrme @kbara in case this is handy as a reference one day.

@lukego
Copy link
Member Author

lukego commented Mar 8, 2016

I ran this with sudo taskset -c 0 ./snabb snsh ./snabbmark_pmu.lua and actually running this in the lab depends on snabblab/snabblab-nixos#23. I did a sneaky way to enable the msr module for PMU on lugano-1 like this:

$ nix-env -i module-init-tools
$ sudo insmod /nix/store/gxk7qq6s8hx2c20dch634xf2bh5hx6l9-linux-4.3.3/lib/modules/4.3.3/kernel/arch/x86/kernel/msr.ko

@mwiget
Copy link
Contributor

mwiget commented Mar 8, 2016

any reason you run it on cpu 0? That cpu gets the most work from the kernel by handling all interrupts.

@lukego
Copy link
Member Author

lukego commented Mar 8, 2016

Finger memory! Looking forward to merging and then following the new recommendations from you and @kbara :)

@lukego
Copy link
Member Author

lukego commented Mar 9, 2016

I braindumped some thoughts on lukego/blog#15. Basically I am wondering whether the PMU can work as a wireshark for catching expensive MESIF interactions (or at least a netstat -s). Seems to me like all performance issues around multicore are likely to be interactions with the L3 cache.

@lukego
Copy link
Member Author

lukego commented Mar 9, 2016

Reflection from lukego/blog#15: I suspect that L3 cache access is the key thing to optimize for any multiprocess application (including Snabb+VM over virtio-net). The L3 cache is "the network" that connects the cores together and has much higher latency than anything they do locally.

Here is a side-by-side comparison of some L3 PMU counters for the mp-ring benchmark (#804 #809) comparing single-process operation (left) with multi-process operation (right):

[luke@lugano-1:~/git/snabbswitch/src]$ pr -w 160 -m -t a b                                                                                                                         
Benchmark configuration:                                                        Benchmark configuration:
       burst: 100                                                                      burst: 100
  writebytes: 0                                                                   writebytes: 0
   processes: 1                                                                    processes: 2
   readbytes: 0                                                                    readbytes: 0
     packets: 10000000                                                               packets: 10000000
        mode: basic                                                                     mode: basic
   pmuevents: mem_load_uops_l3                                                     pmuevents: mem_load_uops_l3
  69.20 Mpps ring throughput per process                                           5.26 Mpps ring throughput per process
PMU report for child #0:                                                        PMU report for child #0:
EVENT                                             TOTAL     /packet             EVENT                                             TOTAL     /packet
cycles                                      500,751,679      50.075             cycles                                    6,651,599,204     665.160
ref_cycles                                            0       0.000             ref_cycles                                            0       0.000
instructions                                562,015,029      56.202             instructions                                593,294,818      59.329
mem_load_uops_l3_hit_retired.xsnp_hit               447       0.000             mem_load_uops_l3_hit_retired.xsnp_hit           432,740       0.043
mem_load_uops_l3_hit_retired.xsnp_hitm                4       0.000             mem_load_uops_l3_hit_retired.xsnp_hitm        9,410,279       0.941
mem_load_uops_l3_hit_retired.xsnp_miss              280       0.000             mem_load_uops_l3_hit_retired.xsnp_miss            9,189       0.001
mem_load_uops_l3_hit_retired.xsnp_none              567       0.000             mem_load_uops_l3_hit_retired.xsnp_none           11,856       0.001
mem_load_uops_l3_miss_retired.local_dram              0       0.000             mem_load_uops_l3_miss_retired.local_dram              0       0.000
mem_load_uops_l3_miss_retired.remote_dram             0       0.000             mem_load_uops_l3_miss_retired.remote_dram             0       0.000
mem_load_uops_l3_miss_retired.remote_fwd              0       0.000             mem_load_uops_l3_miss_retired.remote_fwd              0       0.000
mem_load_uops_l3_miss_retired.remote_hitm             0       0.000             mem_load_uops_l3_miss_retired.remote_hitm             0       0.000
packet                                       10,000,000       1.000             packet                                       10,000,000       1.000

(also as a gist in case that is easier to read.)

I see two interesting and likely related things here:

  1. The single-process version is executing about 10x faster in terms of instructions-per-cycle.
  2. The single-process version has no L3 access while the multi-process version has one xsnp_hitm event per packet. I believe this event means that the L3 cache served a hit from a line that was _m_odified in another core.

From the paper cited in #735 I would expect an L3 hitm to have >100 cycles of latency and that could partly explain the whopping 665 cycles/packet performance that the benchmark is reporting. So I predict that if somebody is able to identify that hitm and reduce its frequency then that will significantly increase the performance of this benchmark.

Now if we had working PEBS support (#631) hooked into a profiler we could also have a report telling us which instruction is triggering that hitm and which data structure it is accessing at the time. This is firmly on my wishlist for the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants