Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster_3_racks_multi_shotover_with_2_shotover_down fix intermittent failures #1813

Merged
merged 3 commits into from
Nov 14, 2024

Conversation

rukai
Copy link
Member

@rukai rukai commented Nov 13, 2024

This PR fixes the intermittent failures in cluster_3_racks_multi_shotover_with_2_shotover_down.

thread 'kafka_int_tests::cluster_3_racks_multi_shotover_with_2_shotover_down::case_1_java' panicked at /home/rukai/Projects/Crates/shotover/shotover-proxy/test-helpers/src/connection/kafka/mod.rs:261:13:
Consumed an unexpected record:
  expected topic "shotover_nodes_go_down_test" ️and it matched
  expected message "Message1" but the message was "initial"
  expected key Some("Key") and it matched
  expected offset 1 but the offset was 0

Locally I could reproduce the issue in about 1/5 test runs. I've run this PR for 40 runs and not reproduced the issue.

My understanding of the problem is:

  1. We create a consumer group and consume 1 record. (we do not commit the offset)
  2. We kill 2 shotover nodes.
  3. The client kafka driver consumer starts hitting errors due to the suddenly killed shotover nodes.
  4. The client kafka driver consumer recovers from these errors by restarting its processing from scratch.
  5. Since we have not committed any offset this means it continues from offset 0 instead of offset 1.
  6. We assert that we have consumed the record at offset 1 but instead got the record at offset 0 so the test fails.

My understanding is this behavior of the driver is entirely reasonable because at any time the client could crash and any records that were consumed but not committed will be reconsumed by whatever client takes the place of the crashed client. So for the driver to be sure that records will not be duplicated it needs to commit them after any processing is done.

So the issue is purely with our test not in shotover's implementation.
The fix to the test is to just add a commit call to the consumer before we kill the shotover nodes.

Copy link

codspeed-hq bot commented Nov 13, 2024

CodSpeed Performance Report

Merging #1813 will not alter performance

Comparing rukai:shotover_nodes_down_flakey_test_fix (de779af) with main (c9b4cca)

Summary

✅ 38 untouched benchmarks

@rukai rukai marked this pull request as ready for review November 13, 2024 23:16
@rukai rukai enabled auto-merge (squash) November 13, 2024 23:38
@rukai rukai merged commit bab7b97 into shotover:main Nov 14, 2024
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants