Numa Awareness in Performance Test

This page is place to work on the documentation of Numa Awareness in HPX. This topic was brought up on our IRC channel and a transcript is posted below. We would like to take this transcript and eventually turn it into a more concise explanation of the subject:

< mariomulansky> i have a question about the paper: http://stellar.cct.lsu.edu/pubs/isc2012.pdf
< mariomulansky> I'm trying to reproduce Fig. 3
< heller> ok
< mariomulansky> but my single socket performance is lower and the 8-socket performance is higher g < heller> sure, that's expected ;)
< mariomulansky> than the one reported in the graph
< heller> err
< heller> you lost me ...
< heller> are you running the jacobi examples?
< mariomulansky> Fig. 3 in the paper reports memory bandwidth from STREAM
< heller> ahh, yes of course
< mariomulansky> so i'm running STREAM on lyra, but i get different numbers - namely a gain of factor ~9 when going from 1 to 48 threads
< heller> ok
< mariomulansky> i like your numbers more, what should i do to get those ;)
< heller> did you set interleaved numa memory placement for the 48 threads run?
< mariomulansky> how? and would that make it faster?
< heller> no
< heller> that would make the 48 thread run slower :P
< heller> the stream benchmarks has perfect NUMA placement
< heller> by default
< heller> that means, there is no inter socket communication
< heller> which makes it faster
< mariomulansky> ah right
< heller> that shouldn't matter for the 1 thread run
< mariomulansky> what is the bandwidth between nodes there?
< heller> the bandwidth between the nodes is neglectable
< heller> it's a few GB/s
< heller> the bottleneck is the bandwidth to main memory
< heller> you can measure the NUMA traffic with likwid-perfctr
< mariomulansky> ah well that's what i mean - the bandwidth to memory from other NUMA domains
< heller> you should also set the array size to 20000000 (stream.c line 57)
< heller> that is determined by the bandwidth of one memory controller
< heller> which is the maximum acheivable bandwidth for one socket
< mariomulansky> ok, so did you set interleaved memory placement?
< mariomulansky> i'm running with 10000000
< heller> the bandwidth reported for the copy test needs to be multiplied by 1.5
< mariomulansky> why? is that the one u used?
< heller> yeah
< heller> well, because you have two loads and one store
< mariomulansky> i see
< heller> I don't remember the exact reason ...
< heller> why it needs to be multiplied with 1.5
< heller> something to with caches ...
< mariomulansky> boah...
< mariomulansky> if i dont pin the threads on 48 runs, the performance gets much worse
< heller> compare: numactl --interleave=0-7 likwid-pin -c 0-47 ./stream_omp
< heller> with: likwid-pin -c 0-47 ./stream_omp
< mariomulansky> numactl: command not found
< heller> mariomulansky: /home/heller/bin/numactl
< heller> use that one
< mariomulansky> permission denied
< heller> one sec
< heller> try again
< heller> and compare:
< heller> /usr/local/bin/likwid-perfctr -g MEM -C S0:0@S1:0 ./stream_omp
< heller> numactl --interleave=0,1 /usr/local/bin/likwid-perfctr -g MEM -C S0:0@S1:0 ./stream_omp
< heller> numactl --membind=0 /usr/local/bin/likwid-perfctr -g MEM -C S0:0@S1:0 ./stream_omp
< mariomulansky> $ ls /home/heller/bin/numactl
< mariomulansky> ls: cannot access /home/heller/bin/numactl: Permission denied
< heller> look at "| Memory bandwidth [MBytes/s] | 9075.92 | 8847.24 |"
< heller> for the different cores
< heller> grmp
< heller> one sec
< heller> mariomulansky: please try again
< heller> mariomulansky: those three commands execute the same benchmark, the only difference is how the memory is placed on the different NUMA domains
< heller> topology matters :P
< mariomulansky> i see
< heller> mariomulansky: also, the perfctr likwid tool will report the correct bandwidth ;)
<@wash> one sec, will just install numactl
< heller> wash: thanks
< heller> mariomulansky: also, try the above commands with the NUMA or NUMA2 performance group (after the -g switch)
< heller> and observe
< mariomulansky> ok i see
< mariomulansky> thanks a lot, this is way more complicated than i would it like to be
< mariomulansky> i have to go now
< mariomulansky> maybe you can tell me what settings you used for Fig. 3 :)
< mariomulansky> thanks wash !
< heller> mariomulansky: didn't i already?
< heller> what is it you are missing?
< heller> i used interleaved memory binding
< heller> which places the memory in a round robin fashion
< mariomulansky> ah ok, that are your settings
< mariomulansky> ok
< heller> where the argument to interleave are only the number of the NUMA domains involved
< heller> so, for a twelve thread run (only two NUMA domains), you only do --interleave=0,1
< heller> if you do --interleave=0-7
< heller> you'll see an increase of performance
< heller> because more memory controllers are used
< mariomulansky> i see - makes sense
< heller> makes perfect sense once you went through the pain of looking at those performance counters with the stream benchmark ;)
< heller> but it's a good exercise
< heller> everyone in this channel should have done this at least once ;)
< mariomulansky> so even if i run on one thread i can use memory from other sockets
< heller> sadly, no one except me did that :(
< heller> yes
< mariomulansky> with numactl
< heller> yup
< mariomulansky> ok
< heller> note, numactl is just a tool that uses the libnuma user level API
< mariomulansky> i will look into that further later
< mariomulansky> now i have to go
< mariomulansky> thanks a lot!
< mariomulansky> bye
< heller> you could even manually place your memory with this (documented) API
< heller> enjoy!
< mariomulansky> thanks :)
< mariomulansky> caio
< heller> aserio: btw, you could pick that conversation up and put into some kind of documentation form ;)

Content

HPX Resource Guide
HPX Source Code Structure and Coding Standards
Improvement of the HPX core runtime
How to Get Involved in Developing HPX
How to Report Bugs in HPX
Known issues in HPX V1.0.0
HPX continuous integration build configurations
How to run HPX on various Cluster environments
Google Summer of Code
Google Season of Documentation
Documentation Projects
Planning and coordination
- Coordination meeting notes
- TODO Roadmap

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numa Awareness in Performance Test

Content

Clone this wiki locally