cluster_3_racks_multi_shotover_with_2_shotover_down fix intermittent failures #1813

rukai · 2024-11-13T22:40:10Z

This PR fixes the intermittent failures in cluster_3_racks_multi_shotover_with_2_shotover_down.

thread 'kafka_int_tests::cluster_3_racks_multi_shotover_with_2_shotover_down::case_1_java' panicked at /home/rukai/Projects/Crates/shotover/shotover-proxy/test-helpers/src/connection/kafka/mod.rs:261:13:
Consumed an unexpected record:
  expected topic "shotover_nodes_go_down_test" ️and it matched
  expected message "Message1" but the message was "initial"
  expected key Some("Key") and it matched
  expected offset 1 but the offset was 0

Locally I could reproduce the issue in about 1/5 test runs. I've run this PR for 40 runs and not reproduced the issue.

My understanding of the problem is:

We create a consumer group and consume 1 record. (we do not commit the offset)
We kill 2 shotover nodes.
The client kafka driver consumer starts hitting errors due to the suddenly killed shotover nodes.
The client kafka driver consumer recovers from these errors by restarting its processing from scratch.
Since we have not committed any offset this means it continues from offset 0 instead of offset 1.
We assert that we have consumed the record at offset 1 but instead got the record at offset 0 so the test fails.

My understanding is this behavior of the driver is entirely reasonable because at any time the client could crash and any records that were consumed but not committed will be reconsumed by whatever client takes the place of the crashed client. So for the driver to be sure that records will not be duplicated it needs to commit them after any processing is done.

So the issue is purely with our test not in shotover's implementation.
The fix to the test is to just add a commit call to the consumer before we kill the shotover nodes.

…failures

codspeed-hq · 2024-11-13T22:57:46Z

CodSpeed Performance Report

Merging #1813 will not alter performance

_{Comparing rukai:shotover_nodes_down_flakey_test_fix (de779af) with main (c9b4cca)}

Summary

✅ 38 untouched benchmarks

cluster_3_racks_multi_shotover_with_2_shotover_down fix intermittent …

4e90192

…failures

rukai marked this pull request as ready for review November 13, 2024 23:16

ronycsdu approved these changes Nov 13, 2024

View reviewed changes

conorbros approved these changes Nov 13, 2024

View reviewed changes

Merge branch 'main' into shotover_nodes_down_flakey_test_fix

7d18e54

rukai enabled auto-merge (squash) November 13, 2024 23:38

Merge branch 'main' into shotover_nodes_down_flakey_test_fix

de779af

rukai merged commit bab7b97 into shotover:main Nov 14, 2024
41 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster_3_racks_multi_shotover_with_2_shotover_down fix intermittent failures #1813

cluster_3_racks_multi_shotover_with_2_shotover_down fix intermittent failures #1813

rukai commented Nov 13, 2024 •

edited

Loading

codspeed-hq bot commented Nov 13, 2024 •

edited

Loading

cluster_3_racks_multi_shotover_with_2_shotover_down fix intermittent failures #1813

cluster_3_racks_multi_shotover_with_2_shotover_down fix intermittent failures #1813

Conversation

rukai commented Nov 13, 2024 • edited Loading

codspeed-hq bot commented Nov 13, 2024 • edited Loading

CodSpeed Performance Report

Merging #1813 will not alter performance

Summary

rukai commented Nov 13, 2024 •

edited

Loading

codspeed-hq bot commented Nov 13, 2024 •

edited

Loading