You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I hit a crash in nmos-cpp-registry when load testing...
Config:
Windows 10, VS 2019
Boost 1.80.0
I launched 400 Nodes with nmos-multi-node on same host as the Registry, along with an NMOS Controller using the Query API WebSocket subscriptions. I thought the Registration API request processing would be throttled by the size of the thread pools for http_listener and pplx, but possibly not (needs more investigation), and ending up hitting the exclusive locker limit of boost::shared_mutex, which results in an uncaught boost::lock_error exception (and the mutex state being incorrect, so just catching the exception isn't going to help).
Before we even hit the uncaught boost::lock_error exception, in nmos::send_query_ws_events_thread, there are several non-crash errors caused by the same issue but in places where the exception is caught and logged as an "Implementation error" and causes "HTTP error: An operation was attempted on a nonexistent network connection. [windows:1229]" presumably because a registration request never got a response.
Possible workarounds for the crash itself involve replacing current boost::shared_mutex with an alternative... std::shared_mutex (but that requires C++17), std::shared_timed_mutex (that requires C++14), BOOST_THREAD_PROVIDES_GENERIC_SHARED_MUTEX_ON_WIN and (probably) BOOST_THREAD_V2_SHARED_MUTEX (requires rebuild/nonstandard boost config), std::mutex (means no reader locks, so needs further testing to prove no deadlocks before even getting to performance profiling, though this is exactly what we had before the switch to use shared_mutex, and the fact that the crash is happening suggests that the profile of model reads and writes that we're seeing may not actually benefit from shared_mutex)
FWIW, this issue has presumably been present since 2018... 9bca76a
However, after experimenting on the different ways to resolve this, with limited success, I realised I'd only been trying Debug builds of the Registry... all the workarounds I tried performed perfectly in Release builds... but then so does the current shared_mutex approach. So for now, I'm just leaving this issue here as a reminder, if boost::lock_error rears its ugly head again...
I hit a crash in nmos-cpp-registry when load testing...
Config:
I launched 400 Nodes with nmos-multi-node on same host as the Registry, along with an NMOS Controller using the Query API WebSocket subscriptions. I thought the Registration API request processing would be throttled by the size of the thread pools for
http_listener
andpplx
, but possibly not (needs more investigation), and ending up hitting the exclusive locker limit ofboost::shared_mutex
, which results in an uncaughtboost::lock_error
exception (and the mutex state being incorrect, so just catching the exception isn't going to help).http://stackoverflow.com/questions/69580941/why-boostshared-mutex-cannot-block-more-than-128-threads
Before we even hit the uncaught
boost::lock_error
exception, innmos::send_query_ws_events_thread
, there are several non-crash errors caused by the same issue but in places where the exception is caught and logged as an "Implementation error" and causes "HTTP error: An operation was attempted on a nonexistent network connection. [windows:1229]" presumably because a registration request never got a response.Possible workarounds for the crash itself involve replacing current
boost::shared_mutex
with an alternative...std::shared_mutex
(but that requires C++17),std::shared_timed_mutex
(that requires C++14),BOOST_THREAD_PROVIDES_GENERIC_SHARED_MUTEX_ON_WIN
and (probably)BOOST_THREAD_V2_SHARED_MUTEX
(requires rebuild/nonstandard boost config),std::mutex
(means no reader locks, so needs further testing to prove no deadlocks before even getting to performance profiling, though this is exactly what we had before the switch to useshared_mutex
, and the fact that the crash is happening suggests that the profile ofmodel
reads and writes that we're seeing may not actually benefit fromshared_mutex
)FWIW, this issue has presumably been present since 2018... 9bca76a
However, after experimenting on the different ways to resolve this, with limited success, I realised I'd only been trying Debug builds of the Registry... all the workarounds I tried performed perfectly in Release builds... but then so does the current
shared_mutex
approach. So for now, I'm just leaving this issue here as a reminder, ifboost::lock_error
rears its ugly head again...(Tangentially relevant: a useful post which explains clearly why it only makes sense to have one upgradable lock at a time... https://stackoverflow.com/questions/59710699/upgrade-shared-lock-to-unique-lock-usage-timing-and-design
And the paper which also describes it... https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3568.html#OverviewUpgrade)
The text was updated successfully, but these errors were encountered: