cumulus/client: added external rpc connection retry logic #5515

iulianbarbu · 2024-08-29T09:25:02Z

Description

Adds retry logic that makes the RPC relay chain interface more reliable for the cases of a collator connecting to external RPC servers.

Closes #5514
Closes #4278

Final solution still debated on #5514 , what this PR addresses might change (e.g. #4278 might require a more advanced approach).

Integration

Users that start collators should barely observe differences based on this logic, since the retry logic applies only in case the collators fail to connect to the RPC servers. In practice I assume the RPC servers are already live before starting collators, and the issue isn't visible.

Review Notes

The added retry logic is for retrying the connection to the RPC servers (which can be multiple). It is at the level of the cumulus/client/relay-chain-rpc-interface module, but more specifically relevant to the RPC clients logic (ClientManager). The retry logic is not configurable, it tries to connect to the RPC client for 5 times, with an exponential backoff in between each iteration starting with 1 second wait time and ending with 16 seconds. The same logic is applied in case an existing connection to an RPC is dropped. There is a ReconnectingWebsocketWorker who ensures there is connectivity to at least on RPC node, and the retry logic makes this stronger by insisting on trying connections to the RPC servers list for 5 times.

Testing

This was tested manually by starting zombienet natively based on 006-rpc_collator_builds_blocks.toml and observing collators don't fail anymore:

zombienet -l text --dir zbn-run -f --provider native spawn polkadot-sdk/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml

Added a unit test that exercises the retry logic for a client connection to a server that comes online in 10 seconds. The retry logic can wait for as long as 30 seconds, but thought that it is too much for a unit test. Just being conscious of CI time if it runs this test, but I am happy to see suggestions around it too. I am not that sure either it runs in CI, haven't figured it out entirely yet. The test can be considered an integration test too, but it exercises crate internal implementation, not the public API.

Collators example logs after the change:

2024-08-29 14:28:11.730  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=0 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:12.737  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:12.739  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:12.755  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:14.758  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:14.759  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:14.760  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:18.766  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:18.768  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:18.768  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:26.770  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=4 index=0 url="ws://127.0.0.1:43617/"

Signed-off-by: Iulian Barbu <[email protected]>

michalkucharczyk · 2024-08-30T13:33:40Z

cumulus/client/relay-chain-rpc-interface/src/reconnecting_ws_client.rs

+		// If we reached the end of the urls list, backoff before retrying
+		// connections to the entire list once more.
+		let Ok(current_iteration) = (counter / urls.len()).try_into() else {
+			tracing::error!(target: LOG_TARGET, "Too many connection attempts to the RPC servers, aborting...");


Suggested change

tracing::error!(target: LOG_TARGET, "Too many connection attempts to the RPC servers, aborting...");

tracing::error!(target: LOG_TARGET, "Too many failed connection attempts to the RPC servers, aborting...");

hm, will this error be ever printed?

I think we need extra check to print the error if loop concluded w/o actual connection.

hm, will this error be ever printed?

In practice it shouldn't be printed ever, but if we'll ever see the log then something's weird (either we're iterating for too many times for the retry logic or memory gets corrupted at runtime).

I think we need extra check to print the error if loop concluded w/o actual connection.

You mean outside of the loop if not concluding with a connection? I added a log here: d47a965. Let me know if that's what you mean.

So when talking about too many attempts we actually talk about ~ 2^64 too many? Does it make sense to impose such limits? I.e., how much time should pass to actually hit it?

We should never hit this branch since this will loop for a maximum urls.len() * DEFAULT_EXTERNAL_RPC_CONN_RETRIES times.

michalkucharczyk · 2024-08-30T13:43:36Z

cumulus/client/relay-chain-rpc-interface/src/reconnecting_ws_client.rs

+		// time to catche the RPC server online and connect to it.
+		let conn_res = tokio::spawn(async move {
+			tokio::time::timeout(
+				Duration::from_secs(8),


CI tests are run on quite overloaded machines. 1s margin may not be enough, but let's see how it goes.

Pushed a simpler version of the test here: ee68d04.

Signed-off-by: Iulian Barbu <[email protected]>

…mal-node-retry-conn-to-external-rpc

michalkucharczyk · 2024-08-30T16:17:50Z

cumulus/client/relay-chain-rpc-interface/src/reconnecting_ws_client.rs

@@ -112,6 +144,8 @@ async fn connect_next_available_rpc_server(
 			Err(err) => tracing::debug!(target: LOG_TARGET, url, ?err, "Unable to connect."),
 		};
 	}
+
+	tracing::error!(target: LOG_TARGET, "Retrying to connect to any external relaychain node failed.");


nit: should we display list of nodes maybe?

That's individually displayed with each iteration for each url. I think we should be fine.

Assuming info level is enabled which may not be a case for defaults. Anyway it is minor.

skunert

Nice Job!
I think this is really nice to have. We can run into the situation where some subsystems try to make calls and waits for a longer time due to retries. I am not sure all places handle that super graceful, but it still sounds a lot better than crashing.

skunert · 2024-08-30T17:08:58Z

cumulus/client/relay-chain-rpc-interface/src/reconnecting_ws_client.rs

 		let index = (starting_position + counter) % urls.len();
 		tracing::info!(
 			target: LOG_TARGET,
+			current_iteration,


Nit: From users perspective we should probably just print attempt here and summarize index and iteration.

Signed-off-by: Iulian Barbu <[email protected]>

…mal-node-retry-conn-to-external-rpc

lexnv

LGTM! Nice job here! This should make things better 👍

paritytech-cicd-pr · 2024-09-02T16:25:09Z

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable 2/3
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7221535

skunert

Once we have two reviews we need to merge fast, every time you merge master it re-requests our review.

iulianbarbu · 2024-09-03T11:45:07Z

Once we have two reviews we need to merge fast, every time you merge master it re-requests our review.

Yes, sorry about that. Trying to figure out how to fix it. I requested to join a few GH teams, hopefully after that's done next PRs will not have this issue.

dmitry-markin · 2024-09-03T12:08:10Z

cumulus/client/relay-chain-rpc-interface/src/reconnecting_ws_client.rs

+		// If we reached the end of the urls list, backoff before retrying
+		// connections to the entire list once more.
+		let Ok(current_iteration) = (counter / urls.len()).try_into() else {
+			tracing::error!(target: LOG_TARGET, "Too many connection attempts to the RPC servers, aborting...");


So when talking about too many attempts we actually talk about ~ 2^64 too many? Does it make sense to impose such limits? I.e., how much time should pass to actually hit it?

# Description Adds retry logic that makes the RPC relay chain interface more reliable for the cases of a collator connecting to external RPC servers. Closes #5514 Closes #4278 Final solution still debated on #5514 , what this PR addresses might change (e.g. #4278 might require a more advanced approach). ## Integration Users that start collators should barely observe differences based on this logic, since the retry logic applies only in case the collators fail to connect to the RPC servers. In practice I assume the RPC servers are already live before starting collators, and the issue isn't visible. ## Review Notes The added retry logic is for retrying the connection to the RPC servers (which can be multiple). It is at the level of the cumulus/client/relay-chain-rpc-interface module, but more specifically relevant to the RPC clients logic (`ClientManager`). The retry logic is not configurable, it tries to connect to the RPC client for 5 times, with an exponential backoff in between each iteration starting with 1 second wait time and ending with 16 seconds. The same logic is applied in case an existing connection to an RPC is dropped. There is a `ReconnectingWebsocketWorker` who ensures there is connectivity to at least on RPC node, and the retry logic makes this stronger by insisting on trying connections to the RPC servers list for 5 times. ## Testing - This was tested manually by starting zombienet natively based on [006-rpc_collator_builds_blocks.toml](https://github.com/paritytech/polkadot-sdk/blob/master/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml) and observing collators don't fail anymore: ```bash zombienet -l text --dir zbn-run -f --provider native spawn polkadot-sdk/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml ``` - Added a unit test that exercises the retry logic for a client connection to a server that comes online in 10 seconds. The retry logic can wait for as long as 30 seconds, but thought that it is too much for a unit test. Just being conscious of CI time if it runs this test, but I am happy to see suggestions around it too. I am not that sure either it runs in CI, haven't figured it out entirely yet. The test can be considered an integration test too, but it exercises crate internal implementation, not the public API. Collators example logs after the change: ``` 2024-08-29 14:28:11.730 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=0 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:12.737 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:12.739 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:12.755 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:14.758 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:14.759 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:14.760 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:18.766 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:18.768 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:18.768 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:26.770 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=4 index=0 url="ws://127.0.0.1:43617/" ``` --------- Signed-off-by: Iulian Barbu <[email protected]> Co-authored-by: Sebastian Kunert <[email protected]>

iulianbarbu added R0-silent Changes should not be mentioned in any release notes T9-cumulus This PR/Issue is related to cumulus. labels Aug 29, 2024

iulianbarbu self-assigned this Aug 29, 2024

iulianbarbu force-pushed the minimal-node-retry-conn-to-external-rpc branch 6 times, most recently from 51cd9b4 to 3434bb1 Compare August 29, 2024 13:59

iulianbarbu removed the R0-silent Changes should not be mentioned in any release notes label Aug 29, 2024

iulianbarbu force-pushed the minimal-node-retry-conn-to-external-rpc branch 3 times, most recently from 83b2520 to a9fa9cc Compare August 29, 2024 14:22

iulianbarbu added 2 commits August 30, 2024 10:14

cumulus/client: added external rpc connection retry logic

455e334

Signed-off-by: Iulian Barbu <[email protected]>

added prdoc

9024ac1

Signed-off-by: Iulian Barbu <[email protected]>

iulianbarbu force-pushed the minimal-node-retry-conn-to-external-rpc branch 3 times, most recently from 4733819 to 7e82332 Compare August 30, 2024 08:22

added retry logic test

8e18ede

Signed-off-by: Iulian Barbu <[email protected]>

iulianbarbu force-pushed the minimal-node-retry-conn-to-external-rpc branch from 7e82332 to 8e18ede Compare August 30, 2024 09:22

iulianbarbu marked this pull request as ready for review August 30, 2024 11:31

Merge branch 'master' into minimal-node-retry-conn-to-external-rpc

0e831e7

iulianbarbu requested review from skunert and michalkucharczyk August 30, 2024 11:38

michalkucharczyk reviewed Aug 30, 2024

View reviewed changes

michalkucharczyk approved these changes Aug 30, 2024

View reviewed changes

iulianbarbu added 2 commits August 30, 2024 17:50

emit log for returning w/o any connection

d47a965

Signed-off-by: Iulian Barbu <[email protected]>

Merge branch 'master' of github.com:paritytech/polkadot-sdk into mini…

d28be29

…mal-node-retry-conn-to-external-rpc

github-actions bot requested a review from michalkucharczyk August 30, 2024 14:52

michalkucharczyk reviewed Aug 30, 2024

View reviewed changes

skunert approved these changes Aug 30, 2024

View reviewed changes

iulianbarbu added 2 commits August 31, 2024 23:36

rename current_iteration field to attempt

4b0d521

Signed-off-by: Iulian Barbu <[email protected]>

Merge branch 'master' of github.com:paritytech/polkadot-sdk into mini…

52f04b4

…mal-node-retry-conn-to-external-rpc

github-actions bot requested review from michalkucharczyk and skunert September 2, 2024 06:19

lexnv approved these changes Sep 2, 2024

View reviewed changes

Merge branch 'master' into minimal-node-retry-conn-to-external-rpc

153fb9f

github-actions bot requested a review from lexnv September 2, 2024 16:04

skunert approved these changes Sep 2, 2024

View reviewed changes

sandreim approved these changes Sep 2, 2024

View reviewed changes

Merge branch 'master' into minimal-node-retry-conn-to-external-rpc

faaea73

github-actions bot requested review from sandreim and skunert September 3, 2024 11:26

skunert approved these changes Sep 3, 2024

View reviewed changes

skunert enabled auto-merge September 3, 2024 11:44

lexnv approved these changes Sep 3, 2024

View reviewed changes

dmitry-markin approved these changes Sep 3, 2024

View reviewed changes

github-actions bot requested review from dmitry-markin, lexnv and skunert September 3, 2024 12:15

michalkucharczyk approved these changes Sep 3, 2024

View reviewed changes

Merge branch 'master' into minimal-node-retry-conn-to-external-rpc

7b09577

skunert added this pull request to the merge queue Sep 3, 2024

Merged via the queue into paritytech:master with commit 4d2f793 Sep 3, 2024
185 of 187 checks passed

iulianbarbu deleted the minimal-node-retry-conn-to-external-rpc branch September 3, 2024 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cumulus/client: added external rpc connection retry logic #5515

cumulus/client: added external rpc connection retry logic #5515

iulianbarbu commented Aug 29, 2024 •

edited

Loading

michalkucharczyk Aug 30, 2024 •

edited

Loading

michalkucharczyk Aug 30, 2024

michalkucharczyk Aug 30, 2024

iulianbarbu Aug 30, 2024

dmitry-markin Sep 3, 2024

skunert Sep 3, 2024

michalkucharczyk Aug 30, 2024 •

edited

Loading

iulianbarbu Aug 30, 2024

michalkucharczyk Aug 30, 2024

iulianbarbu Aug 30, 2024

michalkucharczyk Aug 30, 2024

skunert left a comment

skunert Aug 30, 2024

lexnv left a comment

paritytech-cicd-pr commented Sep 2, 2024

skunert left a comment

iulianbarbu commented Sep 3, 2024

dmitry-markin Sep 3, 2024

	tracing::error!(target: LOG_TARGET, "Too many connection attempts to the RPC servers, aborting...");
	tracing::error!(target: LOG_TARGET, "Too many failed connection attempts to the RPC servers, aborting...");

cumulus/client: added external rpc connection retry logic #5515

cumulus/client: added external rpc connection retry logic #5515

Conversation

iulianbarbu commented Aug 29, 2024 • edited Loading

Description

Integration

Review Notes

Testing

michalkucharczyk Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michalkucharczyk Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skunert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lexnv left a comment

Choose a reason for hiding this comment

paritytech-cicd-pr commented Sep 2, 2024

skunert left a comment

Choose a reason for hiding this comment

iulianbarbu commented Sep 3, 2024

Choose a reason for hiding this comment

iulianbarbu commented Aug 29, 2024 •

edited

Loading

michalkucharczyk Aug 30, 2024 •

edited

Loading

michalkucharczyk Aug 30, 2024 •

edited

Loading