Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call a function on each hardware thread #52

Open
eschnett opened this issue Mar 26, 2017 · 16 comments
Open

Call a function on each hardware thread #52

eschnett opened this issue Mar 26, 2017 · 16 comments

Comments

@eschnett
Copy link

For certain low-level tasks, it is necessary to call a function exactly once on each hardware thread, sometimes even concurrently. For example, I might want to check that the hardware threads' CPU bindings are correctly set up by calling the respective hwloc function, or I might want to initialize PAPI on each thread.

(Why do I suspect problems with CPU bindings? Because I used both OpenMP and Qthreads in an application, and I didn't realize that both set the CPU binding for the main thread, but they do it differently, leading to conflicts and 50% performance loss even if OpenMP is not used, and the OpenMP threads are all sleeping on the OS. These kinds of issues are more easily debugged if one has access to certain low-level primitives in Qthreads.)

I have currently a mechanism to work around this, by starting many threads that busy-loop for a certain amount of time, and this often succeeds. However, a direct implementation and an official API would be convenient.

@m3m0ryh0l3
Copy link

While it's not obvious or documented, you can use qt_loop() for this purpose. qt_loop() guarantees that iterations of the same number will occur on the same processing element AND guarantees that it will spread over all processing elements. Thus, qt_loop(0, qthread_num_workers()-1, func, NULL) will effectively call func once on every (non-disabled) hardware thread.

Is that good enough for your purposes?

@eschnett
Copy link
Author

Thank you, qt_loop seems to be doing exactly what I need.

@eschnett eschnett reopened this Mar 28, 2017
@eschnett
Copy link
Author

Are you sure that qt_loop spreads out the work across all workers? I obtained this output:

$ env FUNHPC_NUM_NODES=1 FUNHPC_NUM_PROCS=1 FUNHPC_NUM_THREADS=8 ./hello
FunHPC: Using 1 nodes, 1 processes per node, 8 threads per process
FunHPC[0]: N0 L0 P0 (S0) T5 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T6 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T7 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T0 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings]

The number after T is the hardware thread, as reported by qthread_worker(0). As you see, several iterations ran on the same hardware thread 4, while none ran e.g. on hardware thread 1.

@m3m0ryh0l3
Copy link

m3m0ryh0l3 commented Mar 28, 2017 via email

@eschnett
Copy link
Author

What performance implications does it have to use fewer shepherds? If tasks don't move, how do the shepherds pick up work?

@m3m0ryh0l3
Copy link

m3m0ryh0l3 commented Mar 29, 2017 via email

@npe9
Copy link
Contributor

npe9 commented Jun 13, 2017

@eschnett was this answer sufficient?

@eschnett
Copy link
Author

qt_loop did not work for me. I am still using my original work-around, which is to start a set of threads, each blocking until all threads are running.

@npe9
Copy link
Contributor

npe9 commented Jun 13, 2017

@eschnett This actually dovetails into some work I'm doing here. I'll see if I can fix the problem. Can you give me sample code, along with examples of the expected and actual behavior?

@eschnett
Copy link
Author

The issue with qt_loop seems to be that it doesn't start one thread per core -- it possibly starts the same number of threads for each shepherd, but that isn't sufficient for me. I really need to start one thread per core, e.g. to set up thread affinity via hwloc. (There is some related discussion above regarding schedulers, shepherds, workers, and cores.)

As example code, I would call hwloc and output the hardware core id for each thread.

@npe9
Copy link
Contributor

npe9 commented Jun 13, 2017

Have you looked at the binders options at all?

@eschnett
Copy link
Author

Yes, I've looked at Qthreads' CPU binding support. The issue is that I might run multiple MPI processes per node, which means that different processes need to use different sets of cores. Setting environment variables to different values for different MPI processes is difficult.

An ideal solution would be if Qthreads had a way to pass in the node-local MPI rank and size.

@npe9
Copy link
Contributor

npe9 commented Jun 14, 2017

What if MPI used Qthreads?

@eschnett
Copy link
Author

@npe9 I what sense would/could MPI use Qthreads?

@npe9
Copy link
Contributor

npe9 commented Jun 14, 2017

Imagine if MPI's underlying threading runtime (for progress and computation threads) were actually Qthreads. So if you're using MPI and Qthreads together they just "work". This space has been mined before (cf. http://dl.acm.org/citation.cfm?id=2712388). If can help you get Mpiq up if you want to play with it.

@ronawho
Copy link
Contributor

ronawho commented Sep 6, 2023

We'd also be interested in a mechanism to call something on each worker thread in order to modify some thread state.

For arm based macs we have interest in setting some quality-of-service flags to limit what cores threads can run on. For more traditional configs we have interest in dynamic unpinning/pinning of the threads to avoid interfering with other parallel runtimes (most commonly a user wants to call out to some OpenMP optimized library and we want a way to get our threads out of the way)

@ronawho ronawho mentioned this issue Sep 7, 2023
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants