Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Flaky Test: testLoggingInterceptor_makeNetworkCall_succeeds #5430

Closed
adhiamboperes opened this issue Jun 18, 2024 · 5 comments · Fixed by #5436
Closed

[BUG]: Flaky Test: testLoggingInterceptor_makeNetworkCall_succeeds #5430

adhiamboperes opened this issue Jun 18, 2024 · 5 comments · Fixed by #5436
Assignees
Labels
bug End user-perceivable behaviors which are not desirable. good first issue This item is good for new contributors to make their pull request. Impact: Medium Moderate perceived user impact (non-blocking bugs and general improvements). Work: Low Solution is clear and broken into good-first-issue-sized chunks.

Comments

@adhiamboperes
Copy link
Collaborator

Describe the bug

Test fails frequently on Gradle:

org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest > testLoggingInterceptor_makeNetworkCall_succeeds FAILED
    java.lang.IllegalStateException: This job has not completed yet
        at kotlinx.coroutines.JobSupport.getCompletedInternal$kotlinx_coroutines_core(JobSupport.kt:1199)
        at kotlinx.coroutines.DeferredCoroutine.getCompleted(Builders.common.kt:100)
        at org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest.testLoggingInterceptor_makeNetworkCall_succeeds(NetworkLoggingInterceptorTest.kt:99)

Steps To Reproduce

Open a PR against develop to trigger a CI run

Expected Behavior

The test should always pass.

Screenshots/Videos

No response

What device/emulator are you using?

CI

Which Android version is your device/emulator running?

No response

Which version of the Oppia Android app are you using?

No response

Additional Context

Because of the frequency of this failure, we should consider annotating it with @flaky, should a fix not be straightforward.

@adhiamboperes adhiamboperes added the bug End user-perceivable behaviors which are not desirable. label Jun 18, 2024
@adhiamboperes adhiamboperes added Impact: Medium Moderate perceived user impact (non-blocking bugs and general improvements). Work: Low Solution is clear and broken into good-first-issue-sized chunks. good first issue This item is good for new contributors to make their pull request. labels Jun 18, 2024
@BenHenning
Copy link
Member

I could be wrong, but is there actually a way to mark this test as flaky in code? Bazel does technically support a way by auto-rerunning flaky tests, but I think it'd be better just to fix the test, instead, since we don't want to encourage checking in flaky code (where we can avoid it).

Currently running 100x runs to see how flaky the test is.

@BenHenning
Copy link
Member

BenHenning commented Jun 20, 2024

With 100x runs, I saw 19 failures:

  • testLoggingInterceptor_makeNetworkCall_succeeds 14/19 times (14% failure rate)
  • testLoggingInterceptor_makeNetworkCallWithInvalidUrl_failsAndCompletes 5/19 times (5% failure rate)

Edit: Also, here's the command run:

bazel test --runs_per_test=100 //data:src/test/java/org/oppia/android/data/backends/gae/NetworkLoggingInterceptorTest

Unfortunately, running the tests above in isolation doesn't seem to actually catch any flakes, so these flakes are seemingly caused by state interaction between the tests of the suite.

Specific failures that we're seeing for each:

testLoggingInterceptor_makeNetworkCall_succeeds:

testLoggingInterceptor_makeNetworkCall_succeeds(org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest)
java.lang.IllegalStateException: This job has not completed yet
        at kotlinx.coroutines.JobSupport.getCompletedInternal$kotlinx_coroutines_core(JobSupport.kt:1199)
        at kotlinx.coroutines.DeferredCoroutine.getCompleted(Builders.common.kt:100)
        at org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest.testLoggingInterceptor_makeNetworkCall_succeeds(NetworkLoggingInterceptorTest.kt:99)

testLoggingInterceptor_makeNetworkCallWithInvalidUrl_failsAndCompletes:

testLoggingInterceptor_makeNetworkCallWithInvalidUrl_failsAndCompletes(org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest)
java.lang.IllegalStateException: This job has not completed yet
        at kotlinx.coroutines.JobSupport.getCompletedInternal$kotlinx_coroutines_core(JobSupport.kt:1199)
        at kotlinx.coroutines.DeferredCoroutine.getCompleted(Builders.common.kt:100)
        at org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest.testLoggingInterceptor_makeNetworkCallWithInvalidUrl_failsAndCompletes(NetworkLoggingInterceptorTest.kt:123)

@BenHenning BenHenning self-assigned this Jun 21, 2024
@BenHenning
Copy link
Member

BenHenning commented Jun 21, 2024

So the ultimate problem for both tests seems to be this line:

    val firstRequest = firstRequestsDeferred.getCompleted().single()

This was caused indirectly by #5402 since it fixed the tests to properly verify what they're trying to verify.

Note that #5402 also made similar changes to ConsoleLoggerTest. I verified that this test, fortunately, doesn't have any flakes per the following run:

bazel test --runs_per_test=100 //utility/src/test/java/org/oppia/android/util/logging:ConsoleLoggerTest

However, this is probably a false result since, per the previous comment, it seems that multiple tests need to cooperate in order to trigger the failure.

@BenHenning
Copy link
Member

I've added a bunch of println instrumentation lines to better understand what's happening, and disabled the testLoggingInterceptor_makeCrashingNetworkCall_failsAndCompletes test just to make sure it isn't impacting the flakiness (and to reduce noise).

See https://gist.github.com/BenHenning/473e76df316acdfc99c609c363db155f for the instrumentation used for these results.

Here's the debug output when testLoggingInterceptor_makeNetworkCall_succeeds fails:

@@@@@@ [B] pre mockWebServer.enqueue(mockResponse)
@@@@@@ [B] pre request execute
@@@@@@ NetworkLoggingInterceptor.intercept()
@@@@@@ intercept() pre-launch success emit launch
@@@@@@ intercept() pre-launch fail 1 emit launch
@@@@@@ intercept() top-level is done
@@@@@@ [B] pre coroutines sync
@@@@@@ [B] pre networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ [B] pre networkLoggingInterceptor.logFailedNetworkCallFlow.take(1).toList()
@@@@@@ intercept() pre actual success emit()
@@@@@@ intercept() pre actual fail 1 emit()
@@@@@@ [B] post networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ [B] post networkLoggingInterceptor.logFailedNetworkCallFlow.take(1).toList()
@@@@@@ intercept() post actual fail 1 emit()
@@@@@@ intercept() post actual success emit()
@@@@@@ [B] pre test coroutine sync
@@@@@@ [B] pre firstRequestsDeferred get completed
@@@@@@ [B] pre firstFailingRequestsDeferred get completed
@@@@@@ [B] post all getCompleted()s
.@@@@@@ [A] pre mockWebServer.enqueue(MockResponse().setBody(testResponseBody))
@@@@@@ [A] pre request execute
@@@@@@ NetworkLoggingInterceptor.intercept()
@@@@@@ intercept() pre-launch success emit launch
@@@@@@ intercept() top-level is done
@@@@@@ [A] pre coroutines sync
@@@@@@ intercept() pre actual success emit()
@@@@@@ [A] pre networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ intercept() post actual success emit()
@@@@@@ [A] pre test coroutine sync
@@@@@@ [A] pre get completed
E
Time: 14.058
There was 1 failure:
1) testLoggingInterceptor_makeNetworkCall_succeeds(org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest)
java.lang.IllegalStateException: This job has not completed yet
        at kotlinx.coroutines.JobSupport.getCompletedInternal$kotlinx_coroutines_core(JobSupport.kt:1199)
        at kotlinx.coroutines.DeferredCoroutine.getCompleted(Builders.common.kt:100)
        at org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest.testLoggingInterceptor_makeNetworkCall_succeeds(NetworkLoggingInterceptorTest.kt:108)

FAILURES!!!
Tests run: 2,  Failures: 1

Here's the debug output when testLoggingInterceptor_makeNetworkCall_succeeds passes:

@@@@@@ [B] pre mockWebServer.enqueue(mockResponse)
@@@@@@ [B] pre request execute
@@@@@@ NetworkLoggingInterceptor.intercept()
@@@@@@ intercept() pre-launch success emit launch
@@@@@@ intercept() pre-launch fail 1 emit launch
@@@@@@ intercept() top-level is done
@@@@@@ [B] pre coroutines sync
@@@@@@ [B] pre networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ [B] pre networkLoggingInterceptor.logFailedNetworkCallFlow.take(1).toList()
@@@@@@ intercept() pre actual success emit()
@@@@@@ intercept() pre actual fail 1 emit()
@@@@@@ [B] post networkLoggingInterceptor.logFailedNetworkCallFlow.take(1).toList()
@@@@@@ [B] post networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ intercept() post actual fail 1 emit()
@@@@@@ intercept() post actual success emit()
@@@@@@ [B] pre test coroutine sync
@@@@@@ [B] pre firstRequestsDeferred get completed
@@@@@@ [B] pre firstFailingRequestsDeferred get completed
@@@@@@ [B] post all getCompleted()s
.@@@@@@ [A] pre mockWebServer.enqueue(MockResponse().setBody(testResponseBody))
@@@@@@ [A] pre request execute
@@@@@@ NetworkLoggingInterceptor.intercept()
@@@@@@ intercept() pre-launch success emit launch
@@@@@@ intercept() top-level is done
@@@@@@ [A] pre coroutines sync
@@@@@@ intercept() pre actual success emit()
@@@@@@ [A] pre networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ [A] post networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ intercept() post actual success emit()
@@@@@@ [A] pre test coroutine sync
@@@@@@ [A] pre get completed
@@@@@@ [A] post get completed

Time: 4.972

OK (2 tests)

The diff is interesting:

image

Specific takeaways:

  • logNetworkCallFlow and logFailedNetworkCallFlow are processed in reverse order in testLoggingInterceptor_makeNetworkCallWithInvalidUrl_failsAndCompletes in the passing case.
  • Fetching networkLoggingInterceptor.logNetworkCallFlow in testLoggingInterceptor_makeNetworkCall_succeeds actually finishes (plus getCompleted finishes--obviously that fails for the failure case since it's the actual exception being thrown).

I'm not certain the order difference matters here, and we're otherwise not getting interesting information on why the flow isn't completing.

@BenHenning
Copy link
Member

Ah, I think I figured it out. Per https://kotlinlang.org/api/kotlinx.coroutines/kotlinx-coroutines-core/kotlinx.coroutines.flow/-mutable-shared-flow.html MutableSharedFlow has no replay caching by default which makes it time sensitive to subscriptions. We can see in the above logs that emit() is actually being called before we try to consume the flow which means there's a data race between starting to observe the flow and a value actually being delivered (events delivered will only block if there's another event being delivered, not if there zero subscribers--those are simply lost).

Adding another additional synchronization barrier seems to fix the problem by ensuring flow consumption happens before emit() is called.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug End user-perceivable behaviors which are not desirable. good first issue This item is good for new contributors to make their pull request. Impact: Medium Moderate perceived user impact (non-blocking bugs and general improvements). Work: Low Solution is clear and broken into good-first-issue-sized chunks.
Development

Successfully merging a pull request may close this issue.

2 participants