Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of "thread-ring" benchmark #48

Open
ronawho opened this issue Dec 2, 2016 · 14 comments
Open

Improve performance of "thread-ring" benchmark #48

ronawho opened this issue Dec 2, 2016 · 14 comments

Comments

@ronawho
Copy link
Contributor

ronawho commented Dec 2, 2016

thread-ring is one of the Computer Language Benchmarks Game (CLBG) codes.

Having Chapel sync vars map down to qthreads aligned_t sync vars (#22, chapel-lang/chapel#4489) dramatically improved performance of thread-ring for us. It resulted in a 6x speedup and put us in 4th place for the actual CLBG competition

There's Haskell, Go, and F# versions that are beating us, and I'm curious what it would take to get closer to them. Thread-ring is a "parallel" application in the sense that it creates a lot of threads/tasks, but only one task actually executes at a time so some of the versions that are beating us do things like limit the runtime to only use one real thread. I wonder if there's a better way to implement thread-ring over qtheads or maybe scheduler/FEB optimizations that would get us better performance.

Here's a native qthreads version that's based off the Chapel version that we submitted

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include "qthread/qthread.h"
#include <qthread/qloop.h>


// native qthreads version of thread-ring, based on Chapel's release version
//   https://github.com/chapel-lang/chapel/blob/master/test/release/examples/benchmarks/shootout/threadring.chpl

//static int n = 1000, ntasks = 503;
static int n = 50000000, ntasks = 503;
static aligned_t *mailbox;

static void passTokens(size_t start, size_t stop, void* arg) {
  uint64_t id = start;

  uint64_t numPasses = 0;
  do {
    qthread_readFE(&numPasses, &mailbox[id]);
    qthread_writeEF_const(&mailbox[(id+1)%ntasks], numPasses+1);
    if (numPasses == n) {
      printf("%ld\n", (long)id+1);
    }
  } while (numPasses < n);
}

static void init() {
  assert(qthread_initialize() == 0);
  printf("%i threads...\n", qthread_num_workers());

  // init array of syncvars (code grabbed from stress/feb_stream.c)
  mailbox = malloc(ntasks * sizeof(aligned_t));
  for (int i=0; i<ntasks; i++) {
    mailbox[i] = 0;
    qthread_empty(&mailbox[i]);
  }
}

int main(int argc, char **argv) {
  init();

  qthread_writeEF_const(&mailbox[0], 0);
  qt_loop_simple(0, ntasks, passTokens, NULL);

  return 0;
}

It runs a second or two faster than the chapel version, but it's pretty similar otherwise.

We're really happy with the performance we're getting now, so this is a really low priority and it wouldn't bother me if this never got worked on. When I first started looking at mapping down to native qthread sync vars I had a lot of fun playing with different implementation so I figured this could be an interesting/fun side project or something to work on in the background that would still benefit us. Plus, I'd imagine improvements to it would improve the performance of FEB operations in general.

@ronawho ronawho mentioned this issue Dec 2, 2016
14 tasks
@npe9 npe9 self-assigned this Dec 2, 2016
npe9 added a commit that referenced this issue Dec 5, 2016
Add a thread ring benchmark to tune performance. Begins to address issue #48.
@npe9
Copy link
Contributor

npe9 commented Dec 6, 2016

I tested the benchmark with nemesis and distrib. I got:

nemesis
real 2m13.800s
user 140m33.444s
sys 1m52.144s

distrib
real 2m17.940s
user 144m54.988s
sys 1m53.671s

I started looking that the profiling data as well, but I don't quite understand it yet.

@ronawho
Copy link
Contributor Author

ronawho commented Dec 6, 2016

Those times look off to me, especially for nemesis where I would expect the real and user times to be pretty similar since the blocked qthreads shouldn't be using up much CPU.

Are you using a chapel-like build? (--enable-condwait-queue, --disable-spawn-cache, etc.)

@npe9
Copy link
Contributor

npe9 commented Dec 6, 2016

Not yet. What's your exactly configure invocation? (or where can I find it in the chapel source?)

@m3m0ryh0l3
Copy link

On some systems, "user" time is multiplied by the number of cores you're using. So, on the latest-and-greatest 72-core Skylake CPU, you could easily get 144 minutes of user-time out of a 2-minute wall-clock time.

@ronawho
Copy link
Contributor Author

ronawho commented Dec 6, 2016

@npe9 https://github.com/chapel-lang/chapel/blob/master/third-party/qthread/Makefile has everything that we set to build qthreads. I think Dylan and/or George also added something to https://github.com/Qthreads/qthreads/blob/master/scripts/build.pl to mimic our config.

And by default we run w/o hyperthreads doing someting like:

nemesis:

QT_GUARD_PAGES=false
QT_WORKER_UNIT=core
QT_NUM_SHEPHERDS=<num physical (not logical) cpus>
QT_NUM_WORKERS_PER_SHEPHERD=1
QT_STACK_SIZE=8388608 (8MB)

distrib:

QT_STEAL_RATIO=0
QT_GUARD_PAGES=false
QT_WORKER_UNIT=core
QT_HWPAR=<num physical (not logical) cpus>
QT_STACK_SIZE=8388608 (8MB)

You can look at our shim layer to see exactly what we set: https://github.com/chapel-lang/chapel/blob/master/runtime/src/tasks/qthreads/tasks-qthreads.c

@m3m0ryh0l3 yeah that's true for the machines I'm running on too, but my point was that while thread-ring creates a lot of tasks, only one is active at a time, and the others are blocked on a sync var.

Here's my timings on a dual socket (2 12-core machine) for nemesis:

real 14.56
user 14.74
sys 0.04

@npe9
Copy link
Contributor

npe9 commented Dec 7, 2016

@ronawho Using your environment variable settings (and 1 shepherd to make sure numa isn't screwing anything up) I get the following on a 2 socket 32 core haswell:

sherwood
real 0m22.206s
user 0m21.843s
sys 0m0.326s

nemesis
real 0m24.182s
user 0m23.821s
sys 0m0.322s

distrib
real 0m23.903s
user 0m23.551s
sys 0m0.324s

Which is closer to what we'd expect. I'm going to dig deeper into what's going using perf to profile the behavior and get back to you. When I get this infrastructure up we can work on tuning distrib as well.

@npe9
Copy link
Contributor

npe9 commented Dec 7, 2016

Got it to the right timing. Perf was interfering.

With distrib I'm getting 15.56s user 0.37s system 99% cpu 15.952 total

@ronawho
Copy link
Contributor Author

ronawho commented Dec 7, 2016

And what if you use <num physical cores> shepherds? For us, nemesis still gets good performance and the real/user times are pretty close, but for distrib we get worse perf and time -p seems shows most cpus are busy.

@npe9
Copy link
Contributor

npe9 commented Dec 7, 2016

Here's a diff of nemesis and distrib running thread ring on 32 cores:

[nevans@shepard-lsm1 benchmarks]$ perf diff perf.data.nemesis perf.data.distrib

Event cycles

Baseline Delta Shared Object Symbol

........ ....... .................... ..............................................

20.44%   -0.18%  libqthread.so.19.0.0  [.] qthread_gotlock_fill_inner
10.68%   +0.04%  libqthread.so.19.0.0  [.] qt_threadqueue_enqueue
 9.24%   +0.08%  libqthread.so.19.0.0  [.] qt_hash64
 7.93%   -0.06%  libqthread.so.19.0.0  [.] qthread_master
 7.07%   +0.26%  libqthread.so.19.0.0  [.] qthread_writeEF
 7.16%   -0.47%  libqthread.so.19.0.0  [.] qt_hash_lock
 5.73%   +0.01%  libqthread.so.19.0.0  [.] qt_swapctxt
 5.37%   +0.12%  libqthread.so.19.0.0  [.] qthread_readFE
 5.26%   -0.16%  libqthread.so.19.0.0  [.] qt_scheduler_get_thread
 4.58%   +0.03%  libqthread.so.19.0.0  [.] qt_hash_get_locked
 2.13%   -0.02%  libpthread-2.12.so    [.] pthread_getspecific
 1.67%   +0.20%  libqthread.so.19.0.0  [.] qt_mpool_alloc
 1.67%   +0.09%  [kernel.kallsyms]     [k] clear_page_c_e
 1.52%   +0.07%  libqthread.so.19.0.0  [.] qt_mpool_free
 1.50%   -0.04%  libqthread.so.19.0.0  [.] qt_mpool_internal_getcache
 1.07%   +0.08%  libqthread.so.19.0.0  [.] qt_setmctxt
 1.03%           libqthread.so.19.0.0  [.] qt_getmctxt
 0.95%   -0.04%  libqthread.so.19.0.0  [.] qthread_internal_self
 0.90%   -0.02%  ld-2.12.so            [.] __tls_get_addr
 0.67%   +0.01%  libqthread.so.19.0.0  [.] qthread_exec
 0.59%   +0.04%  libqthread.so.19.0.0  [.] qt_hash_unlock
 0.44%   +0.02%  libqthread.so.19.0.0  [.] qthread_writeEF_const

It looks like most of the overheads are in synchronization rather than the thread queue. Were you noticing this on thread ring or other performance benchmarks in general?

@npe9
Copy link
Contributor

npe9 commented Dec 7, 2016

Let me try this in the chapel thread ring proper rather than qthreads benchmark and get back to you.

@ronawho
Copy link
Contributor Author

ronawho commented Dec 7, 2016

Ok, we don't have many benchmarks that use a lot of highly contended sync vars, so thread-ring performance isn't really indicative of performance for many other benchmarks.

The performance of thread-ring isn't actually that important to us, but you shouldn't let that stop you from working on it if this is fun/helping you get ramped up with Chapel.

@npe9
Copy link
Contributor

npe9 commented Dec 7, 2016

@ronawho Exactly. I'm just getting a representative environment up. Looking at the performance results in #39 is there a particular benchmark I can spin on optimizing that models a particular job type you'd like to optimize? It looks like distrib and nemesis are within the noise of each other other except for MiniMD and Fasta where distrib wins and thread ring and DGEMM where nemesis is better.

In the meantime I'll work on profiling each of them.

@ronawho
Copy link
Contributor Author

ronawho commented Dec 7, 2016

Ah ok.

I think it's going to be tricky to compare nemesis and distrib until after #38 is resolved. Prior to using a hybrid spin/condwait approach for nemesis, distrib had faster task spawning, so I think that's why cases like minimd are faster for distrib.

I think it'd be easier to get the hybrid spin/condwait stuff done for nemesis and distrib such that task spawn test performs roughly the same for nemesis and distrib. Then I can restart nemesis vs. distrib performance testing and we can compare the results again.

@ronawho
Copy link
Contributor Author

ronawho commented Jan 26, 2021

This is still of mild interest to us, just because it's a somewhat interesting proxy for sync var and scheduler performance, but it's low priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants