Skip to content

Numa Awareness in Performance Test

Adrian Serio edited this page Jul 31, 2013 · 8 revisions

This page is place to work on the documentation of Numa Awareness in HPX. This topic was brought up on our IRC channel and a transcript is posted below. We would like to take this transcript and eventually turn it into a more concise explanation of the subject:

< mariomulansky> i have a question about the paper: http://stellar.cct.lsu.edu/pubs/isc2012.pdf
< mariomulansky> I'm trying to reproduce Fig. 3
< heller> ok
< mariomulansky> but my single socket performance is lower and the 8-socket performance is higher g < heller> sure, that's expected ;)
< mariomulansky> than the one reported in the graph
< heller> err
< heller> you lost me ...
< heller> are you running the jacobi examples?
< mariomulansky> Fig. 3 in the paper reports memory bandwidth from STREAM
< heller> ahh, yes of course
< mariomulansky> so i'm running STREAM on lyra, but i get different numbers - namely a gain of factor ~9 when going from 1 to 48 threads
< heller> ok
< mariomulansky> i like your numbers more, what should i do to get those ;)
< heller> did you set interleaved numa memory placement for the 48 threads run?
< mariomulansky> how? and would that make it faster?
< heller> no
< heller> that would make the 48 thread run slower :P
< heller> the stream benchmarks has perfect NUMA placement
< heller> by default
< heller> that means, there is no inter socket communication
< heller> which makes it faster
< mariomulansky> ah right
< heller> that shouldn't matter for the 1 thread run
< mariomulansky> what is the bandwidth between nodes there?
< heller> the bandwidth between the nodes is neglectable
< heller> it's a few GB/s
< heller> the bottleneck is the bandwidth to main memory
< heller> you can measure the NUMA traffic with likwid-perfctr
< mariomulansky> ah well that's what i mean - the bandwidth to memory from other NUMA domains
< heller> you should also set the array size to 20000000 (stream.c line 57)
< heller> that is determined by the bandwidth of one memory controller
< heller> which is the maximum acheivable bandwidth for one socket
< mariomulansky> ok, so did you set interleaved memory placement?
< mariomulansky> i'm running with 10000000
< heller> the bandwidth reported for the copy test needs to be multiplied by 1.5
< mariomulansky> why? is that the one u used?
< heller> yeah
< heller> well, because you have two loads and one store
< mariomulansky> i see
< heller> I don't remember the exact reason ...
< heller> why it needs to be multiplied with 1.5
< heller> something to with caches ...
< mariomulansky> boah...
< mariomulansky> if i dont pin the threads on 48 runs, the performance gets much worse
< heller> compare: numactl --interleave=0-7 likwid-pin -c 0-47 ./stream_omp
< heller> with: likwid-pin -c 0-47 ./stream_omp
< mariomulansky> numactl: command not found
< heller> mariomulansky: /home/heller/bin/numactl
< heller> use that one
< mariomulansky> permission denied
< heller> one sec
< heller> try again
< heller> and compare:
< heller> /usr/local/bin/likwid-perfctr -g MEM -C S0:0@S1:0 ./stream_omp
< heller> numactl --interleave=0,1 /usr/local/bin/likwid-perfctr -g MEM -C S0:0@S1:0 ./stream_omp
< heller> numactl --membind=0 /usr/local/bin/likwid-perfctr -g MEM -C S0:0@S1:0 ./stream_omp
< mariomulansky> $ ls /home/heller/bin/numactl
< mariomulansky> ls: cannot access /home/heller/bin/numactl: Permission denied
< heller> look at "| Memory bandwidth [MBytes/s] | 9075.92 | 8847.24 |"
< heller> for the different cores
< heller> grmp
< heller> one sec
< heller> mariomulansky: please try again
< heller> mariomulansky: those three commands execute the same benchmark, the only difference is how the memory is placed on the different NUMA domains
< heller> topology matters :P
< mariomulansky> i see
< heller> mariomulansky: also, the perfctr likwid tool will report the correct bandwidth ;)
<@wash> one sec, will just install numactl
< heller> wash: thanks
< heller> mariomulansky: also, try the above commands with the NUMA or NUMA2 performance group (after the -g switch)
< heller> and observe
< mariomulansky> ok i see
< mariomulansky> thanks a lot, this is way more complicated than i would it like to be
< mariomulansky> i have to go now
< mariomulansky> maybe you can tell me what settings you used for Fig. 3 :)
< mariomulansky> thanks wash !
< heller> mariomulansky: didn't i already?
< heller> what is it you are missing?
< heller> i used interleaved memory binding
< heller> which places the memory in a round robin fashion
< mariomulansky> ah ok, that are your settings
< mariomulansky> ok
< heller> where the argument to interleave are only the number of the NUMA domains involved
< heller> so, for a twelve thread run (only two NUMA domains), you only do --interleave=0,1
< heller> if you do --interleave=0-7
< heller> you'll see an increase of performance
< heller> because more memory controllers are used
< mariomulansky> i see - makes sense
< heller> makes perfect sense once you went through the pain of looking at those performance counters with the stream benchmark ;)
< heller> but it's a good exercise
< heller> everyone in this channel should have done this at least once ;)
< mariomulansky> so even if i run on one thread i can use memory from other sockets
< heller> sadly, no one except me did that :(
< heller> yes
< mariomulansky> with numactl
< heller> yup
< mariomulansky> ok
< heller> note, numactl is just a tool that uses the libnuma user level API
< mariomulansky> i will look into that further later
< mariomulansky> now i have to go
< mariomulansky> thanks a lot!
< mariomulansky> bye
< heller> you could even manually place your memory with this (documented) API
< heller> enjoy!
< mariomulansky> thanks :)
< mariomulansky> caio
< heller> aserio: btw, you could pick that conversation up and put into some kind of documentation form ;)

Clone this wiki locally