Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes a bug that made concurrent access of a large nested IonStruct unsafe when only its parent had been made read-only. #722

Merged
merged 1 commit into from
Feb 16, 2024

Conversation

tgregg
Copy link
Contributor

@tgregg tgregg commented Feb 15, 2024

Description of changes:

Releases v1.10.3 - v1.10.5 changed IonValue.clone() and IonValue.makeReadOnly() from recursive to iterative (see #557 and #549). One side effect of this change is that on clone, IonStructs no longer eagerly copied the cloned struct's (and all of its child structs') field maps, which are lazily created for structs with more than 5 fields as an optimization to enable faster field access.

In #630 we identified and fixed a bug that affected concurrent access of cloned, read-only structs with more than 5 members. The struct's field map was being populated in one thread while being accessed in another, resulting in non-deterministic behavior. The fix in #630 was to force a struct's field map to be populated upon being marked read-only, making it impossible for it to be created subsequently during a period of thread contention.

However, this fix did not go far enough because it only populated the field map of the struct on which makeReadOnly() had been called directly, not any child structs. This meant that there was still the possibility of a race condition when accessing child structs of a parent that had been made read-only.

The added readOnlyClonedIonStructMultithreadedNestedAccessSucceeds demonstrated this problem, consistently failing 60-80% of its trials before the fix. All other added tests succeed before and after the fix because they exercise cases where the struct is not cloned (meaning that its field map will be created as the struct is populated), and/or the nested value that is accessed is marked read-only directly (forcing its field map to be populated due to the fix in #630).

The fix included in this PR forces field maps to be created (if applicable) for any struct marked read-only, and for all child structs regardless of depth. This is achieved by piggybacking on the iterative walk of the tree performed in IonValueLite.clearSymbolIDsIterative, which is already employed by IonValue.makeReadOnly. As an added protective measure, we also add a check to IonStructLite.fieldMapIsActive to skip creation of the field map if the struct has already been marked read-only. This change alone is enough to make the failing test pass, but is not a viable solution on its own for performance reasons: large nested child structs of cloned read-only structs would never have field maps created, so every field access would have to be performed sequentially.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Comment on lines 304 to 295
@RepeatedTest(100)
public void readOnlyIonStructMultithreadedAccessSucceeds() {
testReadOnlyIonStructMultithreadedAccess(false);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is repeated 100 times. Each of those 100 times, we repeat 100 times:

  1. call makeReadOnly on the struct
  2. create 4 concurrent tasks which
  3. assert field a is not null when accessed, 100 times

Is this the simplest reproduction? Is @RepeatedTest necessary to provoke failure here? I understand if we don't want to go to great lengths to orchestrate the precise threading state that provokes failure, but I also wonder whether the layers of repetition could be flattened, intuitively I suspect they're not all necessary.

This version of the test is obviously less interesting than the version that clones in \1. above, which also makes me wonder whether we need to have two tests in such cases. Do we need a control (non-cloned) subject in CloneTest? If we do, why not handle both in the same test, do assertions side by side?

This test was added for consistency's sake and as expected does not fail when the fix is not applied, so I'll look to readOnlyClonedIonStructMultithreadedNestedAccessSucceeds for attempted simplifications.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case where clone occurs is more interesting for this particular bug, but not necessarily for all possible bugs. That's why I wanted to test them separately. It would be possible to introduce a bug that would affect only the non-cloned case; if that happens, I think it would be nice to have a test that fails separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get behind having a test that fails separately. In that case I'd rather see the behavior (clone(), making the parent immutable) injected in some way than boolean flags.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the tests to inject the behavior and to remove layers / quantity of repetition while still producing reliable pre-fix failures on my system.

testReadOnlyIonStructMultithreadedNestedAccess(false, true);
}

@RepeatedTest(100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing all instances of @RepeatedTest(100) with @Test on my laptop still causes reliable failure (25/25 test runs) of this test with the fix rolled back, and reduces runtime of this suite from >2s to <150ms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further testing shows a ~95% failure rate here, 6 passes in 100 trials with the first failure in trial 44.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this is not what I observed. I've been seeing 20-40 passes out of 100 pre-fix on my hardware. Adding the repeats is what guaranteed failure in my case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will see if I can provoke consistent failure on my hardware while removing one of the layers of repetition.

src/test/java/com/amazon/ion/CloneTest.java Outdated Show resolved Hide resolved
…nsafe when only its parent had been made read-only.
@tgregg tgregg force-pushed the fix-nested-concurrent-struct-access branch from 38c10f7 to f5b4c9e Compare February 16, 2024 00:24
@tgregg tgregg merged commit e3617e4 into master Feb 16, 2024
21 of 33 checks passed
@tgregg tgregg deleted the fix-nested-concurrent-struct-access branch February 16, 2024 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants