Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aborts in trace due to loop unroll limit and leaving loop in root trace #1239

Open
wants to merge 13 commits into
base: wingo-next
Choose a base branch
from

Conversation

raj2569
Copy link

@raj2569 raj2569 commented Oct 3, 2017

Creating a PR for loop aborts I am seeing so that @lukego can take a look.

This PR is based on wingo-next and has merged in lukego/luajit-reset-penalty-cache.

The test app is at program/jit_loop/ and a helper app is at apps/jit_loop/. The app is basically a bridge which copies packets from one interface to another after changing mac and sends to next hop. Before running the app edit the PCI info, and MAC address in jit_loop.lua.

I am running this as: sudo ./snabb snsh -jdump=+rsxaA,dump.txt -jtprof -p jit_loop

Getting lots of trace abort as shown below:

---- TRACE 28 abort intel_mp.lua:787 -- failed to allocate mcode memory
---- TRACE 17 abort app.lua:511 -- leaving loop in root trace
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 22 abort app.lua:511 -- leaving loop in root trace
---- TRACE 30 abort intel_mp.lua:787 -- failed to allocate mcode memory
---- TRACE 17 abort app.lua:511 -- leaving loop in root trace
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 22 abort app.lua:511 -- leaving loop in root trace
---- TRACE 28 abort intel_mp.lua:787 -- failed to allocate mcode memory
---- TRACE 17 abort app.lua:511 -- leaving loop in root trace
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 21 abort app.lua:511 -- leaving loop in root trace
---- TRACE 28 abort intel_mp.lua:787 -- failed to allocate mcode memory
---- TRACE 16 abort app.lua:511 -- leaving loop in root trace
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 22 abort app.lua:511 -- leaving loop in root trace
---- TRACE 28 abort intel_mp.lua:787 -- failed to allocate mcode memory
---- TRACE 17 abort app.lua:511 -- leaving loop in root trace
---- TRACE 20 abort app.lua:512 -- loop unroll limit reached
---- TRACE 20 abort app.lua:512 -- loop unroll limit reached
---- TRACE 20 abort app.lua:512 -- loop unroll limit reached
---- TRACE 22 abort app.lua:511 -- leaving loop in root trace
---- TRACE 28 abort counter.lua:86 -- failed to allocate mcode memory
---- TRACE 15 abort app.lua:512 -- loop unroll limit reached
---- TRACE 15 abort app.lua:512 -- loop unroll limit reached
---- TRACE 15 abort app.lua:512 -- loop unroll limit reached
---- TRACE 15 abort app.lua:512 -- loop unroll limit reached
---- TRACE 17 abort app.lua:512 -- loop unroll limit reached
---- TRACE 17 abort app.lua:512 -- loop unroll limit reached
---- TRACE 17 abort app.lua:512 -- loop unroll limit reached
---- TRACE 21 abort app.lua:511 -- leaving loop in root trace
---- TRACE 38 abort intel_mp.lua:625 -- leaving loop in root trace
---- TRACE 40 abort ipv4_apps.lua:20 -- loop unroll limit reached
---- TRACE 40 abort ipv4_apps.lua:20 -- loop unroll limit reached
---- TRACE 40 abort ipv4_apps.lua:20 -- loop unroll limit reached
---- TRACE 40 abort ipv4_apps.lua:20 -- loop unroll limit reached
---- TRACE 49 abort intel_mp.lua:600 -- leaving loop in root trace
---- TRACE 70 abort intel_mp.lua:619 -- inner loop in root trace
---- TRACE 70 abort app.lua:511 -- leaving loop in root trace
---- TRACE 87 abort intel_mp.lua:619 -- inner loop in root trace

lukego and others added 8 commits September 1, 2017 09:23
Clear the penalty slot associated with a bytecode after it is traced
successfully.

This prevents penalties from accumulating for bytecodes that are
traced frequently, such as branchy subroutines that are called from
many different root traces and will need to record a side-trace for
each.

Especially intended to handle the special case when applications are
generating code at runtime that creates a "fairly large" number of
root traces and that need correspondingly many side-traces to be
recorded without prematurely and unproductively blacklisting things.
Increase HOTCOUNT_MAX so that the JIT will make more attempts to trace
a bytecode before it blacklists. Expand the HotPenalty.val from 16-bit
to 32-bit to accommodate the larger value.

HOTCOUNT_MAX is increased from 60,000 to 6,000,000. This is a 100x
increase but the effect should be much smaller, log2(100) times,
because the penalty value is increased exponentially.

I don't entirely understand the existing design of the hotcount penalty:
- Why initialize HOTCOUNT_MIN at 36*2?
- Why increase the penalty exponentially instead of incrementally?
- Why add random entropy to the increases?

So I only hope that this patch doesn't break any important properties.
@raj2569
Copy link
Author

raj2569 commented Oct 4, 2017

I have worked on this some more and have some additional observations. To prevent the failed to allocate mcode memory I have increased the luajit tuneables in commit a268212.

I am getting around 6mpps when running this and can see from perf top that most of the time CPU is running on compiled trace. When running under snsh using the command sudo ./snabb snsh -jdump=+rsxaA,dump01.txt -jtprof -p jit_loop the performance drops to 2mpps.

dump01.txt is attached.
dump01.txt.gz

The perf top shows the following when running under snsh:

  12.65%  perf-25499.map      [.] 0x000000002a4d3a60                                                                                             
  10.31%  perf-25499.map      [.] 0x000000002a4d5e06                                                                                             
   7.18%  perf-25499.map      [.] 0x000000002a4d5e0a                                                                                             
   3.85%  perf-25499.map      [.] 0x000000002a4d3a64                                                                                             
   1.88%  perf-25499.map      [.] 0x000000002a4e237c                                                                                             
   1.82%  snabb               [.] lj_tab_len                                                                                                     
   1.82%  perf-25499.map      [.] 0x000000002a4d3a6a                                                                                             
   1.76%  perf-25499.map      [.] 0x000000002a4c9c7f                                                                                             
   1.67%  perf-25499.map      [.] 0x000000002a4cc582                                                                                             
   1.66%  perf-25499.map      [.] 0x000000002a4d5e10                                                                                             
   1.41%  perf-25499.map      [.] 0x000000002a4d39da                                                                                             
   1.21%  perf-25499.map      [.] 0x000000002a4d1e36                                                                                             
   1.11%  perf-25499.map      [.] 0x000000002a4e2d1d                                                                                             
   1.03%  perf-25499.map      [.] 0x000000002a4d5300                                                                                             
   0.75%  [vdso]              [.] __vdso_clock_gettime                                                                                           
   0.69%  perf-25499.map      [.] 0x000000002a4d5e31                                                                                             
   0.66%  perf-25499.map      [.] 0x000000002a4d5d80                                                                                             
   0.57%  perf-25499.map      [.] 0x000000002a4d3a8b                                                                                             
   0.55%  perf-25499.map      [.] 0x000000002a4d20d6                                                                                             
   0.51%  [kernel]            [k] native_write_msr_safe                                                                                          
   0.50%  perf-25499.map      [.] 0x000000002a4d224d                                                                                             
   0.48%  perf-25499.map      [.] 0x000000002a4c65b5                                                                                             
   0.47%  perf-25499.map      [.] 0x000000002a4d433c                                                                                             
   0.47%  snabb               [.] release_unused_segments                                                                                        
   0.46%  snabb               [.] lj_tab_getinth                                                                                                 
   0.46%  perf-25499.map      [.] 0x000000002a4d1e6e                                                                                             
   0.45%  perf-25499.map      [.] 0x000000002a4d1e73                                                                                             
   0.40%  snabb               [.] lj_alloc_malloc                                                                                                
   0.40%  perf-25499.map      [.] 0x000000002a4f09a9                                                                                             
   0.39%  snabb               [.] lj_alloc_free                                                                                                  
   0.39%  [kernel]            [k] __switch_to                                                                                                    
   0.38%  [kernel]            [k] apic_timer_interrupt                                                                                           
   0.37%  perf-25499.map      [.] 0x000000002a4f112c                                                                                             
   0.35%  perf-25499.map      [.] 0x000000002a4d2029                                                                                             
   0.34%  [kernel]            [k] __schedule                                                                                                     
   0.34%  perf-25499.map      [.] 0x000000002a4f0dd2                                                                                             
   0.34%  perf-25499.map      [.] 0x000000002a4e3913                                                                                             
   0.34%  perf-25499.map      [.] 0x000000002a4d20b6                                                                                             
   0.34%  perf-25499.map      [.] 0x000000002a4ea2a7                                                                                             
   0.33%  perf-25499.map      [.] 0x000000002a4f0f29                                                                                             
   0.32%  perf-25499.map      [.] 0x000000002a4f0ba6                                                                                             
   0.30%  snabb               [.] get_monotonic_time                                                                                             
   0.28%  perf-25499.map      [.] 0x000000002a4d1e6a                                                                                             
   0.28%  [kernel]            [k] enqueue_task_fair                                                                                              
   0.27%  perf-25499.map      [.] 0x000000002a4bf504                                                                                             
   0.27%  libc-2.19.so        [.] __clock_gettime                                                                                                
   0.24%  [kernel]            [k] menu_select                                                                                                    

@raj2569
Copy link
Author

raj2569 commented Oct 4, 2017

After adding one more app at commit b7b14cc the performance dropped drastically to around 600kpps. Most of the time snabb spent on interpreted code. I am wondering what could have caused all the traces to be aborted and run in interpreted mode.

Dump file is attached,
dump02.txt.gz

The perf top shows:

   7.77%  libc-2.19.so        [.] vfprintf                                                                                                       
   5.57%  snabb               [.] lj_str_new                                                                                                     
   5.46%  snabb               [.] match                                                                                                          
   3.28%  snabb               [.] lj_strscan_scan                                                                                                
   2.65%  snabb               [.] release_unused_segments                                                                                        
   2.61%  snabb               [.] lj_alloc_free                                                                                                  
   2.40%  perf-25587.map      [.] 0x000000002a4d55db                                                                                             
   2.21%  libc-2.19.so        [.] _IO_default_xsputn                                                                                             
   2.06%  perf-25587.map      [.] 0x000000002a4d4ea5                                                                                             
   2.04%  perf-25587.map      [.] 0x000000002a4d4ea9                                                                                             
   1.36%  snabb               [.] match_class                                                                                                    
   1.34%  snabb               [.] max_expand                                                                                                     
   1.30%  snabb               [.] gc_sweep                                                                                                       
   1.23%  perf-25587.map      [.] 0x000000002a4d55df                                                                                             
   1.15%  snabb               [.] str_find_aux                                                                                                   
   1.15%  libc-2.19.so        [.] _IO_vsprintf                                                                                                   
   0.97%  libc-2.19.so        [.] strchrnul                                                                                                      
   0.92%  snabb               [.] lj_alloc_malloc                                                                                                
   0.73%  perf-25587.map      [.] 0x000000002a4c4f29                                                                                             
   0.70%  snabb               [.] lj_cdata_free                                                                                                  
   0.69%  perf-25587.map      [.] 0x000000002a4c5858                                                                                             
   0.66%  perf-25587.map      [.] 0x000000002a4d5555                                                                                             
   0.64%  perf-25587.map      [.] 0x000000002a4e5d65                                                                                             
   0.59%  snabb               [.] lj_tab_len                                                                                                     
   0.54%  snabb               [.] lj_lib_checkstr                                                                                                
   0.53%  snabb               [.] lj_cdata_newv                                                                                                  
   0.50%  snabb               [.] lua_pushlstring                                                                                                
   0.47%  snabb               [.] singlematch                                                                                                    
   0.45%  libc-2.19.so        [.] 0x00000000000460a0                                                                                             
   0.42%  snabb               [.] lj_cont_stitch                                                                                                 
   0.40%  perf-25587.map      [.] 0x000000002a4e67fc                                                                                             
   0.40%  perf-25587.map      [.] 0x000000002a4d4e1f                                                                                             
   0.38%  snabb               [.] classend.isra.9                                                                                                
   0.37%  perf-25587.map      [.] 0x000000002a4d55e5                                                                                             
   0.37%  perf-25587.map      [.] 0x000000002a4cd9e8                                                                                             
   0.36%  snabb               [.] lj_strscan_num                                                                                                 
   0.33%  snabb               [.] lj_mem_newgco                                                                                                  
   0.32%  snabb               [.] push_onecapture                                                                                                
   0.29%  [kernel]            [k] native_write_msr_safe                                                                                          
   0.29%  [vdso]              [.] __vdso_clock_gettime                                                                                           
   0.29%  snabb               [.] lj_BC_FUNCC                                                                                                    
   0.29%  perf-25587.map      [.] 0x000000002a4ed2b4                                                                                             
   0.29%  snabb               [.] lj_BC_JLOOP                                                                                                    
   0.28%  perf-25587.map      [.] 0x000000002a4d444e                                                                                             
   0.28%  [kernel]            [k] __switch_to                                                                                                    
   0.28%  perf-25587.map      [.] 0x000000002a4d4eaf                                                                                             
   0.27%  perf-25587.map      [.] 0x000000002a4ecf87                                                                                             

@raj2569
Copy link
Author

raj2569 commented Oct 5, 2017

Just wondering which aborted trace could be behind the high use of interpreted code? How can that be identified? It may be very well be possible for many trace aborts to be normal and not have a bearing on performance, but some could be. feeling kind of lost here :(

@lukego
Copy link
Member

lukego commented Nov 2, 2017

Thank you for using a Pull Request to raise this issue. It is very helpful to have the exact code you are working with connected with the bug report. That way the source code line numbers in the trace can be looked up in the right code version. 👍

The JIT log does not explicitly say what is being blacklisted, but it does include a lot of aborts for the same region of code, and that seems likely to be the problem (or in any event something worth understanding.)

 130930:---- TRACE 112 start 95/18 intel_mp.lua:595
 134405:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached
 134408:---- TRACE 112 start 95/18 intel_mp.lua:595
 137883:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached
 137886:---- TRACE 112 start 95/18 intel_mp.lua:595
 141361:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached
 141364:---- TRACE 112 start 95/18 intel_mp.lua:595
 144839:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached

My working hypothesis is that this code is being blacklisted and this is causing some important code path to run interpreted. Could be that raptorjit/raptorjit#102 is the solution i.e. make the JIT give up on the unrolling optimization after a while instead of giving up on the whole trace.

We have seen other instances of blacklistings for code starting in intel_mp lately right? Does anybody already have some insight into these? (I have "swapped out" the traces that I was looking at before vacation but maybe somebody can refresh my memory? cc @alexandergall @wingo @eugeneia)

@raj2569
Copy link
Author

raj2569 commented Nov 2, 2017 via email

Create an execution barrier that a JIT trace cannot cross. During
recording, the current trace must end when it reaches a barrier and
then a new root trace starts immediately afterwards.

This is implemented as a simple "nop" C library function. The existing
"trace stitching" mechanism provides the required semantics i.e. stop
current trace, call nop function, start new trace with linkage.
@lukego
Copy link
Member

lukego commented Nov 6, 2017

@raj2569 Could you please try merging #1242 and post a dump? I hope this will make the JIT trace easier to read by breaking the traces up into smaller parts (always start a new trace when entering/exiting an app callback to avoid mixing too much code together.)

@raj2569
Copy link
Author

raj2569 commented Nov 6, 2017

Merged lukego/jit-tracebarrier and ran the tests again.
Dumpfile is attached. dump-with-1242.txt.gz

@lukego
Copy link
Member

lukego commented Nov 17, 2017

I am using this problem as a test case for improved profiler functions. Have updated the profiler backend in raptorjit/raptorjit#124 with information that should point to a solution, and now I need to update the Studio frontend so that we can see what it means :).

eugeneia pushed a commit to eugeneia/snabb that referenced this pull request Dec 20, 2019
raptorjit: amend prev. commit, also clear trace exit hotcounts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants