Aborts in trace due to loop unroll limit and leaving loop in root trace #1239

raj2569 · 2017-10-03T15:50:53Z

Creating a PR for loop aborts I am seeing so that @lukego can take a look.

This PR is based on wingo-next and has merged in lukego/luajit-reset-penalty-cache.

The test app is at program/jit_loop/ and a helper app is at apps/jit_loop/. The app is basically a bridge which copies packets from one interface to another after changing mac and sends to next hop. Before running the app edit the PCI info, and MAC address in jit_loop.lua.

I am running this as: sudo ./snabb snsh -jdump=+rsxaA,dump.txt -jtprof -p jit_loop

Getting lots of trace abort as shown below:

---- TRACE 28 abort intel_mp.lua:787 -- failed to allocate mcode memory
---- TRACE 17 abort app.lua:511 -- leaving loop in root trace
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 22 abort app.lua:511 -- leaving loop in root trace
---- TRACE 30 abort intel_mp.lua:787 -- failed to allocate mcode memory
---- TRACE 17 abort app.lua:511 -- leaving loop in root trace
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 22 abort app.lua:511 -- leaving loop in root trace
---- TRACE 28 abort intel_mp.lua:787 -- failed to allocate mcode memory
---- TRACE 17 abort app.lua:511 -- leaving loop in root trace
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 21 abort app.lua:511 -- leaving loop in root trace
---- TRACE 28 abort intel_mp.lua:787 -- failed to allocate mcode memory
---- TRACE 16 abort app.lua:511 -- leaving loop in root trace
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 22 abort app.lua:511 -- leaving loop in root trace
---- TRACE 28 abort intel_mp.lua:787 -- failed to allocate mcode memory
---- TRACE 17 abort app.lua:511 -- leaving loop in root trace
---- TRACE 20 abort app.lua:512 -- loop unroll limit reached
---- TRACE 20 abort app.lua:512 -- loop unroll limit reached
---- TRACE 20 abort app.lua:512 -- loop unroll limit reached
---- TRACE 22 abort app.lua:511 -- leaving loop in root trace
---- TRACE 28 abort counter.lua:86 -- failed to allocate mcode memory
---- TRACE 15 abort app.lua:512 -- loop unroll limit reached
---- TRACE 15 abort app.lua:512 -- loop unroll limit reached
---- TRACE 15 abort app.lua:512 -- loop unroll limit reached
---- TRACE 15 abort app.lua:512 -- loop unroll limit reached
---- TRACE 17 abort app.lua:512 -- loop unroll limit reached
---- TRACE 17 abort app.lua:512 -- loop unroll limit reached
---- TRACE 17 abort app.lua:512 -- loop unroll limit reached
---- TRACE 21 abort app.lua:511 -- leaving loop in root trace
---- TRACE 38 abort intel_mp.lua:625 -- leaving loop in root trace
---- TRACE 40 abort ipv4_apps.lua:20 -- loop unroll limit reached
---- TRACE 40 abort ipv4_apps.lua:20 -- loop unroll limit reached
---- TRACE 40 abort ipv4_apps.lua:20 -- loop unroll limit reached
---- TRACE 40 abort ipv4_apps.lua:20 -- loop unroll limit reached
---- TRACE 49 abort intel_mp.lua:600 -- leaving loop in root trace
---- TRACE 70 abort intel_mp.lua:619 -- inner loop in root trace
---- TRACE 70 abort app.lua:511 -- leaving loop in root trace
---- TRACE 87 abort intel_mp.lua:619 -- inner loop in root trace

Clear the penalty slot associated with a bytecode after it is traced successfully. This prevents penalties from accumulating for bytecodes that are traced frequently, such as branchy subroutines that are called from many different root traces and will need to record a side-trace for each. Especially intended to handle the special case when applications are generating code at runtime that creates a "fairly large" number of root traces and that need correspondingly many side-traces to be recorded without prematurely and unproductively blacklisting things.

Increase HOTCOUNT_MAX so that the JIT will make more attempts to trace a bytecode before it blacklists. Expand the HotPenalty.val from 16-bit to 32-bit to accommodate the larger value. HOTCOUNT_MAX is increased from 60,000 to 6,000,000. This is a 100x increase but the effect should be much smaller, log2(100) times, because the penalty value is increased exponentially. I don't entirely understand the existing design of the hotcount penalty: - Why initialize HOTCOUNT_MIN at 36*2? - Why increase the penalty exponentially instead of incrementally? - Why add random entropy to the increases? So I only hope that this patch doesn't break any important properties.

… wingo-next

raj2569 · 2017-10-04T14:34:10Z

I have worked on this some more and have some additional observations. To prevent the failed to allocate mcode memory I have increased the luajit tuneables in commit a268212.

I am getting around 6mpps when running this and can see from perf top that most of the time CPU is running on compiled trace. When running under snsh using the command sudo ./snabb snsh -jdump=+rsxaA,dump01.txt -jtprof -p jit_loop the performance drops to 2mpps.

dump01.txt is attached.
dump01.txt.gz

The perf top shows the following when running under snsh:

  12.65%  perf-25499.map      [.] 0x000000002a4d3a60                                                                                             
  10.31%  perf-25499.map      [.] 0x000000002a4d5e06                                                                                             
   7.18%  perf-25499.map      [.] 0x000000002a4d5e0a                                                                                             
   3.85%  perf-25499.map      [.] 0x000000002a4d3a64                                                                                             
   1.88%  perf-25499.map      [.] 0x000000002a4e237c                                                                                             
   1.82%  snabb               [.] lj_tab_len                                                                                                     
   1.82%  perf-25499.map      [.] 0x000000002a4d3a6a                                                                                             
   1.76%  perf-25499.map      [.] 0x000000002a4c9c7f                                                                                             
   1.67%  perf-25499.map      [.] 0x000000002a4cc582                                                                                             
   1.66%  perf-25499.map      [.] 0x000000002a4d5e10                                                                                             
   1.41%  perf-25499.map      [.] 0x000000002a4d39da                                                                                             
   1.21%  perf-25499.map      [.] 0x000000002a4d1e36                                                                                             
   1.11%  perf-25499.map      [.] 0x000000002a4e2d1d                                                                                             
   1.03%  perf-25499.map      [.] 0x000000002a4d5300                                                                                             
   0.75%  [vdso]              [.] __vdso_clock_gettime                                                                                           
   0.69%  perf-25499.map      [.] 0x000000002a4d5e31                                                                                             
   0.66%  perf-25499.map      [.] 0x000000002a4d5d80                                                                                             
   0.57%  perf-25499.map      [.] 0x000000002a4d3a8b                                                                                             
   0.55%  perf-25499.map      [.] 0x000000002a4d20d6                                                                                             
   0.51%  [kernel]            [k] native_write_msr_safe                                                                                          
   0.50%  perf-25499.map      [.] 0x000000002a4d224d                                                                                             
   0.48%  perf-25499.map      [.] 0x000000002a4c65b5                                                                                             
   0.47%  perf-25499.map      [.] 0x000000002a4d433c                                                                                             
   0.47%  snabb               [.] release_unused_segments                                                                                        
   0.46%  snabb               [.] lj_tab_getinth                                                                                                 
   0.46%  perf-25499.map      [.] 0x000000002a4d1e6e                                                                                             
   0.45%  perf-25499.map      [.] 0x000000002a4d1e73                                                                                             
   0.40%  snabb               [.] lj_alloc_malloc                                                                                                
   0.40%  perf-25499.map      [.] 0x000000002a4f09a9                                                                                             
   0.39%  snabb               [.] lj_alloc_free                                                                                                  
   0.39%  [kernel]            [k] __switch_to                                                                                                    
   0.38%  [kernel]            [k] apic_timer_interrupt                                                                                           
   0.37%  perf-25499.map      [.] 0x000000002a4f112c                                                                                             
   0.35%  perf-25499.map      [.] 0x000000002a4d2029                                                                                             
   0.34%  [kernel]            [k] __schedule                                                                                                     
   0.34%  perf-25499.map      [.] 0x000000002a4f0dd2                                                                                             
   0.34%  perf-25499.map      [.] 0x000000002a4e3913                                                                                             
   0.34%  perf-25499.map      [.] 0x000000002a4d20b6                                                                                             
   0.34%  perf-25499.map      [.] 0x000000002a4ea2a7                                                                                             
   0.33%  perf-25499.map      [.] 0x000000002a4f0f29                                                                                             
   0.32%  perf-25499.map      [.] 0x000000002a4f0ba6                                                                                             
   0.30%  snabb               [.] get_monotonic_time                                                                                             
   0.28%  perf-25499.map      [.] 0x000000002a4d1e6a                                                                                             
   0.28%  [kernel]            [k] enqueue_task_fair                                                                                              
   0.27%  perf-25499.map      [.] 0x000000002a4bf504                                                                                             
   0.27%  libc-2.19.so        [.] __clock_gettime                                                                                                
   0.24%  [kernel]            [k] menu_select

raj2569 · 2017-10-04T15:12:09Z

After adding one more app at commit b7b14cc the performance dropped drastically to around 600kpps. Most of the time snabb spent on interpreted code. I am wondering what could have caused all the traces to be aborted and run in interpreted mode.

Dump file is attached,
dump02.txt.gz

The perf top shows:

   7.77%  libc-2.19.so        [.] vfprintf                                                                                                       
   5.57%  snabb               [.] lj_str_new                                                                                                     
   5.46%  snabb               [.] match                                                                                                          
   3.28%  snabb               [.] lj_strscan_scan                                                                                                
   2.65%  snabb               [.] release_unused_segments                                                                                        
   2.61%  snabb               [.] lj_alloc_free                                                                                                  
   2.40%  perf-25587.map      [.] 0x000000002a4d55db                                                                                             
   2.21%  libc-2.19.so        [.] _IO_default_xsputn                                                                                             
   2.06%  perf-25587.map      [.] 0x000000002a4d4ea5                                                                                             
   2.04%  perf-25587.map      [.] 0x000000002a4d4ea9                                                                                             
   1.36%  snabb               [.] match_class                                                                                                    
   1.34%  snabb               [.] max_expand                                                                                                     
   1.30%  snabb               [.] gc_sweep                                                                                                       
   1.23%  perf-25587.map      [.] 0x000000002a4d55df                                                                                             
   1.15%  snabb               [.] str_find_aux                                                                                                   
   1.15%  libc-2.19.so        [.] _IO_vsprintf                                                                                                   
   0.97%  libc-2.19.so        [.] strchrnul                                                                                                      
   0.92%  snabb               [.] lj_alloc_malloc                                                                                                
   0.73%  perf-25587.map      [.] 0x000000002a4c4f29                                                                                             
   0.70%  snabb               [.] lj_cdata_free                                                                                                  
   0.69%  perf-25587.map      [.] 0x000000002a4c5858                                                                                             
   0.66%  perf-25587.map      [.] 0x000000002a4d5555                                                                                             
   0.64%  perf-25587.map      [.] 0x000000002a4e5d65                                                                                             
   0.59%  snabb               [.] lj_tab_len                                                                                                     
   0.54%  snabb               [.] lj_lib_checkstr                                                                                                
   0.53%  snabb               [.] lj_cdata_newv                                                                                                  
   0.50%  snabb               [.] lua_pushlstring                                                                                                
   0.47%  snabb               [.] singlematch                                                                                                    
   0.45%  libc-2.19.so        [.] 0x00000000000460a0                                                                                             
   0.42%  snabb               [.] lj_cont_stitch                                                                                                 
   0.40%  perf-25587.map      [.] 0x000000002a4e67fc                                                                                             
   0.40%  perf-25587.map      [.] 0x000000002a4d4e1f                                                                                             
   0.38%  snabb               [.] classend.isra.9                                                                                                
   0.37%  perf-25587.map      [.] 0x000000002a4d55e5                                                                                             
   0.37%  perf-25587.map      [.] 0x000000002a4cd9e8                                                                                             
   0.36%  snabb               [.] lj_strscan_num                                                                                                 
   0.33%  snabb               [.] lj_mem_newgco                                                                                                  
   0.32%  snabb               [.] push_onecapture                                                                                                
   0.29%  [kernel]            [k] native_write_msr_safe                                                                                          
   0.29%  [vdso]              [.] __vdso_clock_gettime                                                                                           
   0.29%  snabb               [.] lj_BC_FUNCC                                                                                                    
   0.29%  perf-25587.map      [.] 0x000000002a4ed2b4                                                                                             
   0.29%  snabb               [.] lj_BC_JLOOP                                                                                                    
   0.28%  perf-25587.map      [.] 0x000000002a4d444e                                                                                             
   0.28%  [kernel]            [k] __switch_to                                                                                                    
   0.28%  perf-25587.map      [.] 0x000000002a4d4eaf                                                                                             
   0.27%  perf-25587.map      [.] 0x000000002a4ecf87

raj2569 · 2017-10-05T14:06:54Z

Just wondering which aborted trace could be behind the high use of interpreted code? How can that be identified? It may be very well be possible for many trace aborts to be normal and not have a bearing on performance, but some could be. feeling kind of lost here :(

lukego · 2017-11-02T12:21:04Z

Thank you for using a Pull Request to raise this issue. It is very helpful to have the exact code you are working with connected with the bug report. That way the source code line numbers in the trace can be looked up in the right code version. 👍

The JIT log does not explicitly say what is being blacklisted, but it does include a lot of aborts for the same region of code, and that seems likely to be the problem (or in any event something worth understanding.)

 130930:---- TRACE 112 start 95/18 intel_mp.lua:595
 134405:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached
 134408:---- TRACE 112 start 95/18 intel_mp.lua:595
 137883:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached
 137886:---- TRACE 112 start 95/18 intel_mp.lua:595
 141361:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached
 141364:---- TRACE 112 start 95/18 intel_mp.lua:595
 144839:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached

My working hypothesis is that this code is being blacklisted and this is causing some important code path to run interpreted. Could be that raptorjit/raptorjit#102 is the solution i.e. make the JIT give up on the unrolling optimization after a while instead of giving up on the whole trace.

We have seen other instances of blacklistings for code starting in intel_mp lately right? Does anybody already have some insight into these? (I have "swapped out" the traces that I was looking at before vacation but maybe somebody can refresh my memory? cc @alexandergall @wingo @eugeneia)

raj2569 · 2017-11-02T12:35:19Z

I am seeing quite a lot of trace aborts related to loop, not just in `intel_mp`, but in other modules also. The code in PR is just a small sample we can use for diagnosis.

…

On Thu, Nov 2, 2017 at 5:51 PM, Luke Gorrie ***@***.***> wrote: Thank you for using a Pull Request to raise this issue. It is very helpful to have the exact code you are working with connected with the bug report. That way the source code line numbers in the trace can be looked up in the right code version. 👍 The JIT log does not explicitly say what is being blacklisted, but it does include a lot of aborts for the same region of code, and that seems likely to be the problem (or in any event something worth understanding.) 130930:---- TRACE 112 start 95/18 intel_mp.lua:595 134405:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached 134408:---- TRACE 112 start 95/18 intel_mp.lua:595 137883:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached 137886:---- TRACE 112 start 95/18 intel_mp.lua:595 141361:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached 141364:---- TRACE 112 start 95/18 intel_mp.lua:595 144839:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached My working hypothesis is that this code is being blacklisted and this is causing some important code path to run interpreted. Could be that raptorjit/raptorjit#102 <raptorjit/raptorjit#102> is the solution i.e. make the JIT give up on the unrolling optimization after a while instead of giving up on the whole trace. We have seen other instances of blacklistings for code starting in intel_mp lately right? Does anybody already have some insight into these? (I have "swapped out" the traces that I was looking at before vacation but maybe somebody can refresh my memory? cc @alexandergall <https://github.com/alexandergall> @wingo <https://github.com/wingo> @eugeneia <https://github.com/eugeneia>) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1239 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAB3whF9LCWyu7R_SCdnwejFx6_xrBPVks5sybOxgaJpZM4PsWIb> .

Create an execution barrier that a JIT trace cannot cross. During recording, the current trace must end when it reaches a barrier and then a new root trace starts immediately afterwards. This is implemented as a simple "nop" C library function. The existing "trace stitching" mechanism provides the required semantics i.e. stop current trace, call nop function, start new trace with linkage.

lukego · 2017-11-06T05:42:57Z

@raj2569 Could you please try merging #1242 and post a dump? I hope this will make the JIT trace easier to read by breaking the traces up into smaller parts (always start a new trace when entering/exiting an app callback to avoid mixing too much code together.)

…e_abort

raj2569 · 2017-11-06T07:03:04Z

Merged lukego/jit-tracebarrier and ran the tests again.
Dumpfile is attached. dump-with-1242.txt.gz

lukego · 2017-11-17T11:37:02Z

I am using this problem as a test case for improved profiler functions. Have updated the profiler backend in raptorjit/raptorjit#124 with information that should point to a solution, and now I need to update the Studio frontend so that we can see what it means :).

raptorjit: amend prev. commit, also clear trace exit hotcounts

lukego and others added 8 commits September 1, 2017 09:23

fixup! lj_jit.h: Increase HOTCOUNT_MAX and expand HotPenalty.val

95daadb

lj_trace.c: Reset hotcount table after flush

dab64f7

lj_record.c: Relax heuristic for root trace meeting JIT loop

b346227

Merge remote-tracking branch 'lukego/luajit-reset-penalty-cache' into…

37592a4

… wingo-next

jit_loop app

aaeacfe

Increase luajit tuneables to avoid failed to allocate mcode memory

a268212

Adding a lpm based tiny router to the mix.

b7b14cc

lukego added 3 commits November 5, 2017 08:40

core.engine: jit.tracebarrier() before each app callback

bfc8612

core.engine: jit.tracebarrier() after each app callback too

e8c0808

Merge remote-tracking branch 'lukego/jit-tracebarrier' into with_trac…

c342f7b

…e_abort

lukego mentioned this pull request Nov 17, 2017

VMProfile 3: Updated profiler data collection raptorjit/raptorjit#124

Closed

2 tasks

lukego self-assigned this Nov 17, 2017

lukego mentioned this pull request Nov 17, 2017

engine: Separate JIT traces for each app callback #1242

Merged

eugeneia pushed a commit to eugeneia/snabb that referenced this pull request Dec 20, 2019

Merge pull request snabbco#1239 from eugeneia/stable-perf

49d5e76

raptorjit: amend prev. commit, also clear trace exit hotcounts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aborts in trace due to loop unroll limit and leaving loop in root trace #1239

Aborts in trace due to loop unroll limit and leaving loop in root trace #1239

raj2569 commented Oct 3, 2017

raj2569 commented Oct 4, 2017

raj2569 commented Oct 4, 2017

raj2569 commented Oct 5, 2017

lukego commented Nov 2, 2017

raj2569 commented Nov 2, 2017 via email

lukego commented Nov 6, 2017

raj2569 commented Nov 6, 2017

lukego commented Nov 17, 2017

Aborts in trace due to loop unroll limit and leaving loop in root trace #1239

Are you sure you want to change the base?

Aborts in trace due to loop unroll limit and leaving loop in root trace #1239

Conversation

raj2569 commented Oct 3, 2017

raj2569 commented Oct 4, 2017

raj2569 commented Oct 4, 2017

raj2569 commented Oct 5, 2017

lukego commented Nov 2, 2017

raj2569 commented Nov 2, 2017 via email

lukego commented Nov 6, 2017

raj2569 commented Nov 6, 2017

lukego commented Nov 17, 2017