-
Notifications
You must be signed in to change notification settings - Fork 696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cumulus/client: added external rpc connection retry logic #5515
cumulus/client: added external rpc connection retry logic #5515
Conversation
51cd9b4
to
3434bb1
Compare
83b2520
to
a9fa9cc
Compare
Signed-off-by: Iulian Barbu <[email protected]>
Signed-off-by: Iulian Barbu <[email protected]>
4733819
to
7e82332
Compare
Signed-off-by: Iulian Barbu <[email protected]>
7e82332
to
8e18ede
Compare
// If we reached the end of the urls list, backoff before retrying | ||
// connections to the entire list once more. | ||
let Ok(current_iteration) = (counter / urls.len()).try_into() else { | ||
tracing::error!(target: LOG_TARGET, "Too many connection attempts to the RPC servers, aborting..."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tracing::error!(target: LOG_TARGET, "Too many connection attempts to the RPC servers, aborting..."); | |
tracing::error!(target: LOG_TARGET, "Too many failed connection attempts to the RPC servers, aborting..."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, will this error be ever printed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need extra check to print the error if loop concluded w/o actual connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, will this error be ever printed?
In practice it shouldn't be printed ever, but if we'll ever see the log then something's weird (either we're iterating for too many times for the retry logic or memory gets corrupted at runtime).
I think we need extra check to print the error if loop concluded w/o actual connection.
You mean outside of the loop if not concluding with a connection? I added a log here: d47a965. Let me know if that's what you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when talking about too many attempts we actually talk about ~ 2^64 too many? Does it make sense to impose such limits? I.e., how much time should pass to actually hit it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should never hit this branch since this will loop for a maximum urls.len() * DEFAULT_EXTERNAL_RPC_CONN_RETRIES
times.
// time to catche the RPC server online and connect to it. | ||
let conn_res = tokio::spawn(async move { | ||
tokio::time::timeout( | ||
Duration::from_secs(8), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI tests are run on quite overloaded machines. 1s margin may not be enough, but let's see how it goes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a simpler version of the test here: ee68d04.
Signed-off-by: Iulian Barbu <[email protected]>
…mal-node-retry-conn-to-external-rpc
@@ -112,6 +144,8 @@ async fn connect_next_available_rpc_server( | |||
Err(err) => tracing::debug!(target: LOG_TARGET, url, ?err, "Unable to connect."), | |||
}; | |||
} | |||
|
|||
tracing::error!(target: LOG_TARGET, "Retrying to connect to any external relaychain node failed."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we display list of nodes maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's individually displayed with each iteration for each url. I think we should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming info
level is enabled which may not be a case for defaults. Anyway it is minor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice Job!
I think this is really nice to have. We can run into the situation where some subsystems try to make calls and waits for a longer time due to retries. I am not sure all places handle that super graceful, but it still sounds a lot better than crashing.
let index = (starting_position + counter) % urls.len(); | ||
tracing::info!( | ||
target: LOG_TARGET, | ||
current_iteration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: From users perspective we should probably just print attempt
here and summarize index and iteration.
Signed-off-by: Iulian Barbu <[email protected]>
…mal-node-retry-conn-to-external-rpc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice job here! This should make things better 👍
The CI pipeline was cancelled due to failure one of the required jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we have two reviews we need to merge fast, every time you merge master it re-requests our review.
Yes, sorry about that. Trying to figure out how to fix it. I requested to join a few GH teams, hopefully after that's done next PRs will not have this issue. |
// If we reached the end of the urls list, backoff before retrying | ||
// connections to the entire list once more. | ||
let Ok(current_iteration) = (counter / urls.len()).try_into() else { | ||
tracing::error!(target: LOG_TARGET, "Too many connection attempts to the RPC servers, aborting..."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when talking about too many attempts we actually talk about ~ 2^64 too many? Does it make sense to impose such limits? I.e., how much time should pass to actually hit it?
4d2f793
# Description Adds retry logic that makes the RPC relay chain interface more reliable for the cases of a collator connecting to external RPC servers. Closes #5514 Closes #4278 Final solution still debated on #5514 , what this PR addresses might change (e.g. #4278 might require a more advanced approach). ## Integration Users that start collators should barely observe differences based on this logic, since the retry logic applies only in case the collators fail to connect to the RPC servers. In practice I assume the RPC servers are already live before starting collators, and the issue isn't visible. ## Review Notes The added retry logic is for retrying the connection to the RPC servers (which can be multiple). It is at the level of the cumulus/client/relay-chain-rpc-interface module, but more specifically relevant to the RPC clients logic (`ClientManager`). The retry logic is not configurable, it tries to connect to the RPC client for 5 times, with an exponential backoff in between each iteration starting with 1 second wait time and ending with 16 seconds. The same logic is applied in case an existing connection to an RPC is dropped. There is a `ReconnectingWebsocketWorker` who ensures there is connectivity to at least on RPC node, and the retry logic makes this stronger by insisting on trying connections to the RPC servers list for 5 times. ## Testing - This was tested manually by starting zombienet natively based on [006-rpc_collator_builds_blocks.toml](https://github.com/paritytech/polkadot-sdk/blob/master/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml) and observing collators don't fail anymore: ```bash zombienet -l text --dir zbn-run -f --provider native spawn polkadot-sdk/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml ``` - Added a unit test that exercises the retry logic for a client connection to a server that comes online in 10 seconds. The retry logic can wait for as long as 30 seconds, but thought that it is too much for a unit test. Just being conscious of CI time if it runs this test, but I am happy to see suggestions around it too. I am not that sure either it runs in CI, haven't figured it out entirely yet. The test can be considered an integration test too, but it exercises crate internal implementation, not the public API. Collators example logs after the change: ``` 2024-08-29 14:28:11.730 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=0 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:12.737 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:12.739 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:12.755 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:14.758 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:14.759 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:14.760 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:18.766 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:18.768 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:18.768 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:26.770 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=4 index=0 url="ws://127.0.0.1:43617/" ``` --------- Signed-off-by: Iulian Barbu <[email protected]> Co-authored-by: Sebastian Kunert <[email protected]>
Description
Adds retry logic that makes the RPC relay chain interface more reliable for the cases of a collator connecting to external RPC servers.
Closes #5514
Closes #4278
Final solution still debated on #5514 , what this PR addresses might change (e.g. #4278 might require a more advanced approach).
Integration
Users that start collators should barely observe differences based on this logic, since the retry logic applies only in case the collators fail to connect to the RPC servers. In practice I assume the RPC servers are already live before starting collators, and the issue isn't visible.
Review Notes
The added retry logic is for retrying the connection to the RPC servers (which can be multiple). It is at the level of the cumulus/client/relay-chain-rpc-interface module, but more specifically relevant to the RPC clients logic (
ClientManager
). The retry logic is not configurable, it tries to connect to the RPC client for 5 times, with an exponential backoff in between each iteration starting with 1 second wait time and ending with 16 seconds. The same logic is applied in case an existing connection to an RPC is dropped. There is aReconnectingWebsocketWorker
who ensures there is connectivity to at least on RPC node, and the retry logic makes this stronger by insisting on trying connections to the RPC servers list for 5 times.Testing
Collators example logs after the change: