Improve performance of "thread-ring" benchmark #48

ronawho · 2016-12-02T17:06:13Z

thread-ring is one of the Computer Language Benchmarks Game (CLBG) codes.

Having Chapel sync vars map down to qthreads aligned_t sync vars (#22, chapel-lang/chapel#4489) dramatically improved performance of thread-ring for us. It resulted in a 6x speedup and put us in 4th place for the actual CLBG competition

There's Haskell, Go, and F# versions that are beating us, and I'm curious what it would take to get closer to them. Thread-ring is a "parallel" application in the sense that it creates a lot of threads/tasks, but only one task actually executes at a time so some of the versions that are beating us do things like limit the runtime to only use one real thread. I wonder if there's a better way to implement thread-ring over qtheads or maybe scheduler/FEB optimizations that would get us better performance.

Here's a native qthreads version that's based off the Chapel version that we submitted

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include "qthread/qthread.h"
#include <qthread/qloop.h>


// native qthreads version of thread-ring, based on Chapel's release version
//   https://github.com/chapel-lang/chapel/blob/master/test/release/examples/benchmarks/shootout/threadring.chpl

//static int n = 1000, ntasks = 503;
static int n = 50000000, ntasks = 503;
static aligned_t *mailbox;

static void passTokens(size_t start, size_t stop, void* arg) {
  uint64_t id = start;

  uint64_t numPasses = 0;
  do {
    qthread_readFE(&numPasses, &mailbox[id]);
    qthread_writeEF_const(&mailbox[(id+1)%ntasks], numPasses+1);
    if (numPasses == n) {
      printf("%ld\n", (long)id+1);
    }
  } while (numPasses < n);
}

static void init() {
  assert(qthread_initialize() == 0);
  printf("%i threads...\n", qthread_num_workers());

  // init array of syncvars (code grabbed from stress/feb_stream.c)
  mailbox = malloc(ntasks * sizeof(aligned_t));
  for (int i=0; i<ntasks; i++) {
    mailbox[i] = 0;
    qthread_empty(&mailbox[i]);
  }
}

int main(int argc, char **argv) {
  init();

  qthread_writeEF_const(&mailbox[0], 0);
  qt_loop_simple(0, ntasks, passTokens, NULL);

  return 0;
}

It runs a second or two faster than the chapel version, but it's pretty similar otherwise.

We're really happy with the performance we're getting now, so this is a really low priority and it wouldn't bother me if this never got worked on. When I first started looking at mapping down to native qthread sync vars I had a lot of fun playing with different implementation so I figured this could be an interesting/fun side project or something to work on in the background that would still benefit us. Plus, I'd imagine improvements to it would improve the performance of FEB operations in general.

Add a thread ring benchmark to tune performance. Begins to address issue #48.

npe9 · 2016-12-06T00:04:30Z

I tested the benchmark with nemesis and distrib. I got:

nemesis
real 2m13.800s
user 140m33.444s
sys 1m52.144s

distrib
real 2m17.940s
user 144m54.988s
sys 1m53.671s

I started looking that the profiling data as well, but I don't quite understand it yet.

ronawho · 2016-12-06T00:55:36Z

Those times look off to me, especially for nemesis where I would expect the real and user times to be pretty similar since the blocked qthreads shouldn't be using up much CPU.

Are you using a chapel-like build? (--enable-condwait-queue, --disable-spawn-cache, etc.)

npe9 · 2016-12-06T02:47:17Z

Not yet. What's your exactly configure invocation? (or where can I find it in the chapel source?)

m3m0ryh0l3 · 2016-12-06T03:42:33Z

On some systems, "user" time is multiplied by the number of cores you're using. So, on the latest-and-greatest 72-core Skylake CPU, you could easily get 144 minutes of user-time out of a 2-minute wall-clock time.

ronawho · 2016-12-06T04:53:53Z

@npe9 https://github.com/chapel-lang/chapel/blob/master/third-party/qthread/Makefile has everything that we set to build qthreads. I think Dylan and/or George also added something to https://github.com/Qthreads/qthreads/blob/master/scripts/build.pl to mimic our config.

And by default we run w/o hyperthreads doing someting like:

nemesis:

QT_GUARD_PAGES=false
QT_WORKER_UNIT=core
QT_NUM_SHEPHERDS=<num physical (not logical) cpus>
QT_NUM_WORKERS_PER_SHEPHERD=1
QT_STACK_SIZE=8388608 (8MB)

distrib:

QT_STEAL_RATIO=0
QT_GUARD_PAGES=false
QT_WORKER_UNIT=core
QT_HWPAR=<num physical (not logical) cpus>
QT_STACK_SIZE=8388608 (8MB)

You can look at our shim layer to see exactly what we set: https://github.com/chapel-lang/chapel/blob/master/runtime/src/tasks/qthreads/tasks-qthreads.c

@m3m0ryh0l3 yeah that's true for the machines I'm running on too, but my point was that while thread-ring creates a lot of tasks, only one is active at a time, and the others are blocked on a sync var.

Here's my timings on a dual socket (2 12-core machine) for nemesis:

real 14.56
user 14.74
sys 0.04

npe9 · 2016-12-07T16:37:23Z

@ronawho Using your environment variable settings (and 1 shepherd to make sure numa isn't screwing anything up) I get the following on a 2 socket 32 core haswell:

sherwood
real 0m22.206s
user 0m21.843s
sys 0m0.326s

nemesis
real 0m24.182s
user 0m23.821s
sys 0m0.322s

distrib
real 0m23.903s
user 0m23.551s
sys 0m0.324s

Which is closer to what we'd expect. I'm going to dig deeper into what's going using perf to profile the behavior and get back to you. When I get this infrastructure up we can work on tuning distrib as well.

npe9 · 2016-12-07T16:46:05Z

Got it to the right timing. Perf was interfering.

With distrib I'm getting 15.56s user 0.37s system 99% cpu 15.952 total

ronawho · 2016-12-07T17:57:56Z

And what if you use <num physical cores> shepherds? For us, nemesis still gets good performance and the real/user times are pretty close, but for distrib we get worse perf and time -p seems shows most cpus are busy.

npe9 · 2016-12-07T18:01:29Z

Here's a diff of nemesis and distrib running thread ring on 32 cores:

[nevans@shepard-lsm1 benchmarks]$ perf diff perf.data.nemesis perf.data.distrib

Event cycles

Baseline Delta Shared Object Symbol

........ ....... .................... ..............................................

20.44%   -0.18%  libqthread.so.19.0.0  [.] qthread_gotlock_fill_inner
10.68%   +0.04%  libqthread.so.19.0.0  [.] qt_threadqueue_enqueue
 9.24%   +0.08%  libqthread.so.19.0.0  [.] qt_hash64
 7.93%   -0.06%  libqthread.so.19.0.0  [.] qthread_master
 7.07%   +0.26%  libqthread.so.19.0.0  [.] qthread_writeEF
 7.16%   -0.47%  libqthread.so.19.0.0  [.] qt_hash_lock
 5.73%   +0.01%  libqthread.so.19.0.0  [.] qt_swapctxt
 5.37%   +0.12%  libqthread.so.19.0.0  [.] qthread_readFE
 5.26%   -0.16%  libqthread.so.19.0.0  [.] qt_scheduler_get_thread
 4.58%   +0.03%  libqthread.so.19.0.0  [.] qt_hash_get_locked
 2.13%   -0.02%  libpthread-2.12.so    [.] pthread_getspecific
 1.67%   +0.20%  libqthread.so.19.0.0  [.] qt_mpool_alloc
 1.67%   +0.09%  [kernel.kallsyms]     [k] clear_page_c_e
 1.52%   +0.07%  libqthread.so.19.0.0  [.] qt_mpool_free
 1.50%   -0.04%  libqthread.so.19.0.0  [.] qt_mpool_internal_getcache
 1.07%   +0.08%  libqthread.so.19.0.0  [.] qt_setmctxt
 1.03%           libqthread.so.19.0.0  [.] qt_getmctxt
 0.95%   -0.04%  libqthread.so.19.0.0  [.] qthread_internal_self
 0.90%   -0.02%  ld-2.12.so            [.] __tls_get_addr
 0.67%   +0.01%  libqthread.so.19.0.0  [.] qthread_exec
 0.59%   +0.04%  libqthread.so.19.0.0  [.] qt_hash_unlock
 0.44%   +0.02%  libqthread.so.19.0.0  [.] qthread_writeEF_const

It looks like most of the overheads are in synchronization rather than the thread queue. Were you noticing this on thread ring or other performance benchmarks in general?

npe9 · 2016-12-07T18:08:52Z

Let me try this in the chapel thread ring proper rather than qthreads benchmark and get back to you.

ronawho · 2016-12-07T18:13:11Z

Ok, we don't have many benchmarks that use a lot of highly contended sync vars, so thread-ring performance isn't really indicative of performance for many other benchmarks.

The performance of thread-ring isn't actually that important to us, but you shouldn't let that stop you from working on it if this is fun/helping you get ramped up with Chapel.

npe9 · 2016-12-07T18:26:18Z

@ronawho Exactly. I'm just getting a representative environment up. Looking at the performance results in #39 is there a particular benchmark I can spin on optimizing that models a particular job type you'd like to optimize? It looks like distrib and nemesis are within the noise of each other other except for MiniMD and Fasta where distrib wins and thread ring and DGEMM where nemesis is better.

In the meantime I'll work on profiling each of them.

ronawho · 2016-12-07T18:42:03Z

Ah ok.

I think it's going to be tricky to compare nemesis and distrib until after #38 is resolved. Prior to using a hybrid spin/condwait approach for nemesis, distrib had faster task spawning, so I think that's why cases like minimd are faster for distrib.

I think it'd be easier to get the hybrid spin/condwait stuff done for nemesis and distrib such that task spawn test performs roughly the same for nemesis and distrib. Then I can restart nemesis vs. distrib performance testing and we can compare the results again.

ronawho · 2021-01-26T16:51:23Z

This is still of mild interest to us, just because it's a somewhat interesting proxy for sync var and scheduler performance, but it's low priority.

ronawho mentioned this issue Dec 2, 2016

Chapel Wish List #36

Open

14 tasks

npe9 added low priority enhancement labels Dec 2, 2016

npe9 self-assigned this Dec 2, 2016

npe9 added a commit that referenced this issue Dec 5, 2016

Add thread ring benchmark

0107aaf

Add a thread ring benchmark to tune performance. Begins to address issue #48.

janciesko added the Ready to close label Dec 10, 2020

janciesko removed the Ready to close label Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of "thread-ring" benchmark #48

Improve performance of "thread-ring" benchmark #48

ronawho commented Dec 2, 2016

npe9 commented Dec 6, 2016

ronawho commented Dec 6, 2016

npe9 commented Dec 6, 2016

m3m0ryh0l3 commented Dec 6, 2016

ronawho commented Dec 6, 2016

npe9 commented Dec 7, 2016

npe9 commented Dec 7, 2016

ronawho commented Dec 7, 2016

npe9 commented Dec 7, 2016 •

edited

Loading

Event cycles

Baseline Delta Shared Object Symbol

........ ....... .................... ..............................................

npe9 commented Dec 7, 2016

ronawho commented Dec 7, 2016

npe9 commented Dec 7, 2016

ronawho commented Dec 7, 2016

ronawho commented Jan 26, 2021 •

edited

Loading

Improve performance of "thread-ring" benchmark #48

Improve performance of "thread-ring" benchmark #48

Comments

ronawho commented Dec 2, 2016

npe9 commented Dec 6, 2016

ronawho commented Dec 6, 2016

npe9 commented Dec 6, 2016

m3m0ryh0l3 commented Dec 6, 2016

ronawho commented Dec 6, 2016

npe9 commented Dec 7, 2016

npe9 commented Dec 7, 2016

ronawho commented Dec 7, 2016

npe9 commented Dec 7, 2016 • edited Loading

Event cycles

Baseline Delta Shared Object Symbol

........ ....... .................... ..............................................

npe9 commented Dec 7, 2016

ronawho commented Dec 7, 2016

npe9 commented Dec 7, 2016

ronawho commented Dec 7, 2016

ronawho commented Jan 26, 2021 • edited Loading

npe9 commented Dec 7, 2016 •

edited

Loading

ronawho commented Jan 26, 2021 •

edited

Loading