You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, when a fork happens, the dyad context is not re-initialized, which potentially causes UCX endpoint creation errors. We have to investigate what to reinitialize.
Current thoughts
Reintialize DYAD CTX.
check if UCX can be reinitialized from the forked process.
The text was updated successfully, but these errors were encountered:
I'm pretty sure we will have to reinitialize everything. At bare minimum, we will need to reinitialize the DTL because the UCX context and worker cannot be shared across processes. I am also pretty sure that anything Flux related (e.g., the flux_t handle) will need to be reinitialized.
We will support two modes of child process creation. Forking and Spawning. We will not support threading for now until we are confident that multi-process support is robust.
For process creation, it is important to understand various mechanisms by which a new process is created so that we can identify solutions to trigger initialization upon creation. python multiprocessing fork seems to rely on system fork while python spawn does not. Python multiprocessing supports at-fork custom callback. According to Hari, pytorch offers similar capability in itself. In some cases, we may need to intercept creation calls and add dyad initialization. I will at least add a call to reinitialize, and define an environment variable to select the re-initialization behavior rather than the default one with which initialization will be skipped if dyad context object exits. PR #63
Currently, when a fork happens, the dyad context is not re-initialized, which potentially causes UCX endpoint creation errors. We have to investigate what to reinitialize.
Current thoughts
The text was updated successfully, but these errors were encountered: