-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock on destruction #334
Comments
What about other peers in this session, are they also stuck? Or only one process? This does not ring a bell, but I'll look into it. Do you have something that reproduces this reliable (or at least frequently)? |
This could be similar. I don't see any crashes. First process gets stuck on termination, and the next one on startup. |
(lldb) thread backtrace all.txt Attaching another potential instance of this mentioned by @ktf on mattermost. |
What I think it's happening is the following:
Can we simply get rid of the asynchronous API and the thread for the Region events? Does it serve any specific purpose? |
Yes, you can. If you do not call |
That will mean that I need to check every time if a new region has been added, no? |
yup. |
ok, but then this is not an option either, especially if we need to process at a few 100s Hz, no? Can't we have some |
Ideally you would handle all the required events before any processing starts, right? Is this still used for GPU registration, or is there something else?
I'm not sure where we could hook this up. If its synchronous, the question becomes synchronous with what? We don't really have an event loop that is always alive. Maybe in the Running state, but even that is not mandatory. Any idea @dennisklein? |
I'm also not sure I see the difference between |
Whether or not the registration can happen in the middle of the processing is probably better answered by @ironMann or @davidrohr. My understanding was that it could have actually happened, hence the event subscription. What I mean is that right now I process the RegionInfoEvent events between one timeframe and the other, and I only process the events which arrived since last iteration. If I use GetRegionInfo I would need to keep track of which region was already there and which one is new, which is clearly suboptimal. That said, it will still not work, AFAICT. If I understand correctly the issue comes from the multiple process sharing a mutex in shared memory, not because of the threads, no? If my understanding is correct, do we need the mutex in the first place? Can't we simply use some lockfree queue / mechanism so that if a process die, no lock is withhold? |
The notion to not use a thread was brought into the discussion by yourself 😜. neither @rbx nor me thought it is related to this issue. We just answered your questions about it. 😅 And yes, I agree with this assessment.
We need synchronisation. We probably do not need a shmem mutex as implemented right now.
Sounds like a good idea to me. If it is simple, I won't dare to say right now. Any hints/insights into how to do it are welcome if you already have something concrete in mind. |
You still take me too seriously... ;-) That said, it's related in the sense that (i thought) the extra thread (and the mutex I use to avoid invoking callbacks during processing) make the actual issue more likely to happen. I actually did AliceO2Group/AliceO2#7881 which should reduce the chances of the race condition to happen, but apparently does not remove it completely.
See AliceO2Group/AliceO2#7881 for some idea. Maybe you can do something very similar on your side as well? I am not sure how |
FairMQ v1.4.45 contains changes that should significantly reduce the likelihood of this issue occurring. The mtx is locked fewer times and for a shorter period. A lockfree approach still under investigation. |
I think I see some threading related deadlock between:
and
The other threads are doing:
Does it ring any bell? This is while shutting down a process...
The text was updated successfully, but these errors were encountered: