Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNDB-12484: Fix flaky VectorUpdateDeleteTest failures #1519

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

jkni
Copy link

@jkni jkni commented Jan 17, 2025

What is the issue

VectorUpdateDeleteTest sees flaky failures where it fails to verify the checksums of all sstables.

What does this PR fix and why was it fixed

SAITester.verifyChecksum is swapped to consider canonical sstables rather than live sstables. If the SAITester.verifier overlaps with a compaction, it could previously see a preemptively opened SSTable. Validating the SAI components for this preemptively opened sstable fails, as they haven't yet been written. The verifier should only consider canonical sstables, which are expected to have valid SAI components.

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jbellis
Copy link

jbellis commented Jan 17, 2025

I'm going to scope creep you a bit here, sorry!

  • Can you add javadoc to getLiveSSTables to point out what it does and when you need to consider early open (and by implication not use getLive)?
  • Probably worth adding a getCanonicalSSTables method so getLive isn't the only "obvious" choice
  • Almost all of the usage in the test package should probably be getCanonical instead of getLive, I toyed with the idea of asking an LLM to go through them on a case by case bases but I kind of think we'll be better served by just changing them all over to getCanonical and reverting any that break

@michaeljmarshall
Copy link
Member

Probably worth adding a getCanonicalSSTables method so getLive isn't the only "obvious" choice

+1 great suggestion

@jkni
Copy link
Author

jkni commented Jan 17, 2025

Scope creep accepted. I'll update the PR shortly.

…getLiveSSTables/getCanonicalSSTables. Swap SAITester's verification to getCanonicalSSTables.
@jkni
Copy link
Author

jkni commented Jan 17, 2025

Ah, despite multiplexing this 200 times, the CI run here caught a real issue. We need a proper reference to the canonical sstables to prevent them from being removed from the filesystem during verification. I'll push a fix.

…fyChecksum. We need a ref to prevent them from getting tidied out from under the verification.
@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-1519 rejected by Butler


2 new test failure(s) in 3 builds
See build details here


Found 2 new test failures

Test Explanation Branch history Upstream history
...tCorruptedSSTablesWithLeveledCompactionStrategy regression 🔴🔴🔴 🔵🔵🔵🔵🔵🔵🔵
...ataset=uuid,wide=false,scenario=MEMTABLE_QUERY] regression 🔴🔵🔵 🔵🔵🔵🔵🔵🔵🔵

Found 14 known test failures

@jkni
Copy link
Author

jkni commented Jan 21, 2025

@JeremiahDJordan @michaeljmarshall can you re-review with the more expansive fix? The fix to use canonical sstables exposed another failure mode, which is that the verifier needs the files to exist on disk, so it wasn't enough to have the sstable reader objects. The code has now been reworked to hold a reference to the sstables, so that they can't get tidied out from under the test verifier. I've deferred the live/canonical convenience methods and test usage checking until my next backlog rotation in the follow up issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants