Reader/writer model locking #315

garethsb · 2023-01-11T12:59:03Z

I hit a crash in nmos-cpp-registry when load testing...

Config:

Windows 10, VS 2019
Boost 1.80.0

I launched 400 Nodes with nmos-multi-node on same host as the Registry, along with an NMOS Controller using the Query API WebSocket subscriptions. I thought the Registration API request processing would be throttled by the size of the thread pools for http_listener and pplx, but possibly not (needs more investigation), and ending up hitting the exclusive locker limit of boost::shared_mutex, which results in an uncaught boost::lock_error exception (and the mutex state being incorrect, so just catching the exception isn't going to help).

http://stackoverflow.com/questions/69580941/why-boostshared-mutex-cannot-block-more-than-128-threads

Before we even hit the uncaught boost::lock_error exception, in nmos::send_query_ws_events_thread, there are several non-crash errors caused by the same issue but in places where the exception is caught and logged as an "Implementation error" and causes "HTTP error: An operation was attempted on a nonexistent network connection. [windows:1229]" presumably because a registration request never got a response.

Possible workarounds for the crash itself involve replacing current boost::shared_mutex with an alternative...
std::shared_mutex (but that requires C++17), std::shared_timed_mutex (that requires C++14), BOOST_THREAD_PROVIDES_GENERIC_SHARED_MUTEX_ON_WIN and (probably) BOOST_THREAD_V2_SHARED_MUTEX (requires rebuild/nonstandard boost config), std::mutex (means no reader locks, so needs further testing to prove no deadlocks before even getting to performance profiling, though this is exactly what we had before the switch to use shared_mutex, and the fact that the crash is happening suggests that the profile of model reads and writes that we're seeing may not actually benefit from shared_mutex)

FWIW, this issue has presumably been present since 2018... 9bca76a

However, after experimenting on the different ways to resolve this, with limited success, I realised I'd only been trying Debug builds of the Registry... all the workarounds I tried performed perfectly in Release builds... but then so does the current shared_mutex approach. So for now, I'm just leaving this issue here as a reminder, if boost::lock_error rears its ugly head again...

(Tangentially relevant: a useful post which explains clearly why it only makes sense to have one upgradable lock at a time... https://stackoverflow.com/questions/59710699/upgrade-shared-lock-to-unique-lock-usage-timing-and-design
And the paper which also describes it... https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3568.html#OverviewUpgrade)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reader/writer model locking #315

Reader/writer model locking #315

garethsb commented Jan 11, 2023 •

edited

Loading

Reader/writer model locking #315

Reader/writer model locking #315

Comments

garethsb commented Jan 11, 2023 • edited Loading

garethsb commented Jan 11, 2023 •

edited

Loading