-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exceptions routinely raised while running with shmem #278
Comments
It is trying to open a queue that should be created by the monitor. If it keeps doing it for you it likely means that the monitor is not (yet) started. Is that intentional in your case, or does it fail to locate it? |
I do not create the monitor intentionally, because that can leak a zombie. Is there any way to disable the heartbeat checking for it? |
No, there isn't. Even if the monitor is not automatically launched, device should not assume one will not be started at a later point, e.g. for debugging. Starting the monitor manually will not only check for present shared memory regions, but also which devices are still alive (= sending heartbeats). |
I've a second problem, which I think it's actually related. Now when there is a segfault in one of the devices the others get in some weird state:
|
Does it also happen on Linux? I can reproduce a similar situation via sigkilling one device - another will hang on MacOS, but not on Linux. Still investigating if I indeed see the same as you on MacOS. I don't think it is related to the original issue, but is something more serious. On the first sight it looks to me as a problem similar to FairRootGroup/DDS@73b7209#diff-af3b638bc2a3e6c650974192a53c7291R134 |
Adding the workaround cure (
So indeed it looks like it is trying to acquire a mutex that has been locked by the dead process. A bit surprizing that it hangs in Manager.h:349 ( |
thanks. Any progress on this? |
Still using this issue because I am not sure if related or not. I see also some deadlock in:
not always, but often enough. Does it ring any bell? |
It rings a bell if that is also when a peer is crashed/killed. There are several places where a deadlock can occur in that case. Otherwise it could be something else. |
mmm... indeed I might be screwing up something downstream with my libuv attempts... |
I've been playing with But what is a good value for the timeout? Something could be blocking for a valid reason. In that case it is not reasonable from me to add a timeout for it. I think if some device does crash or is killed, it is valid to let others that are waiting for it hang, at least from the perspective of those devices. The controller should detect a crash and decide how to handle it (which is shmem case is to restart other shm users, assuming the crashed device corrupted the contents). |
@ktf, @rbx: TBH, I am worried about adding timeout-locks, let me explain how I see it: What happens, if an OS thread within an OS process crashes (uncatchable)? - All other threads within this process are killed preemptively, right? By whom - some external control entity (e.g. OS kernel). If we attach shared memory to two OS processes they essentially become tightly coupled very similar to OS threads within the same process but with a custom external control entity. Within this analogy, if one (shmem-enabled) FairMQ devices fails in an uncontrolled manner, the control entity shall detect it and preemptively kill the other FairMQ devices within the same session. So,
Using timeouts means we want to resolve the device failure cooperatively and distributively (no central controller). Since we cannot just assume |
@dennisklein I fully agree with your assessment, however this is basically saying that if we use the shared memory backend once a process dies all the ones connected to the same shared memory region should probably be killed, which currently would mean that the whole topology is killed. We could think of mitigate the issue by having non-shared memory based transports in a few strategic places, but the problem remains, I guess. |
@dennisklein I also agree that most likely timed locks are a cure which is worse than the disease. |
I guess, we should look into more detail on the failing device side and see, if we can handle more of the common error cases with regard to the shmem transport. SIGSEGV can have user handler, can it not? Etc.. |
Well, I would not trust anything in memory after a SIGSEGV, frankly, and I would just terminate and maybe dump a stacktrace. Anything beyond that is asking for trouble. I've seen even the stacktrace dumping resulting in locking issues in a remote past... IMHO one thing which would improve the situation is the ability to connect "read-only" to a given shared memory region, so that a crash can only affect the devices downstream. |
Declaring a device as read-only will surely give some assurance that it hasn't corrupted anything if it does crash, although I'm not sure how trustworthy that would be. To allow this I think a flag for the controller would be sufficient that marks a device as read-only. Not sure what guarantees we can give here from FairMQ side - once we give out the buffer pointer, it can get messed up. We could make the pointer Another issue that arises when a shmem-participating device crashes is that it is likely to contain meta-data messages in its ZeroMQ queue that become lost during the crash. The meta info is lost, but the shmem is still occupied by the buffers that the meta data points to. Which means a chunk of memory is now occupied by something without reference and it will not get cleaned until a full reset. In the worst case a large enough chunk is occupied that nothing can write to shmem anymore. Two possible recovery solutions are - (1) do additional book-keeping for the in-flight messages, (2) replace the ZeroMQ queues with interprocess queues, where the queues itself are located in the shared memory and thus are not lost when crash occurs. Both would have some performance draw backs, but probably within acceptable limits. My latest test with a simple implementation for (2) showed something like 30% lower transfer rate (compared to 1MHz, so still well within requirements I imagine). Simple implementation for (1) would involve something like a interprocess queue in addition to zmq, so in my mind going straight for (2) is better. Implementation of (2) would also, at least at first, have a smaller feature set (e.g. only PAIR sockets) and higher cost of multiplexing between multiple channels, because interprocess queues from boost don't have anything like a file descriptor to integrate in asio or a different polling mechanism. But as a bonus (2) would give better control over the queue capacity and current level (compared to zmq, where there are at least 4 queues with different properties for each channel), (although we don't provide FairMQ API to check current level). This still doesn't solve the trustworthiness of the memory after a crash though - even if the code that caused the crash didn't have a meta data to another buffer, it could still have corrupted it via segmentation violation. |
By "read only" i meant actually protected memory pages, not simply "const". I think with mmap you could do it (e.g. map a file PROT_READ|PROT_WRITE on one side, PROT_READ on the receiving side), not sure if shm_* stuff allows for it. |
Yes, boost::interprocess allows read only access. It translates to mmap flags. |
Can we somehow control that from the Channel specification? DPL does know which channels are for reading and which are for writing. |
No, that is not implemented atm. Neither is "read only" in general. I was just saying that it is possible. Does a per-channel flag make sense? If a device has an input reading channel and and output writing channel, the flag for the shared memory segment would have to be read_write. Also the channels are typically instantiated after the segment. So if we add such a setting, it would be for the transport factory. |
If it's not per channel, then it's not much useful. So basically you are saying I should have one transport for the inputs and another one for the outputs of my device, in order to achieve what I want? |
I'm saying if you have a device that is allowed to write to shared memory (via any one of the channels), and it crashes, then you cannot assume it didn't mess up anything, downstream or upstream. For this scenario I am only considering case where there is a single segment for the entire topology. |
I can assume that the shared pages which were read only were not affected, no? So Given the communication graph (at least for DPL) is a DAG, I could safely assume that upstream (of the one which crashed) devices were not affected. |
I see that the
fair::mq::shmem::Manager::SendHeartbeats
method raises a number of exceptions, which are apparently caught and ignored. Is this normal?The text was updated successfully, but these errors were encountered: