[BUG]: Flaky Test: testLoggingInterceptor_makeNetworkCall_succeeds #5430

adhiamboperes · 2024-06-18T20:56:15Z

Describe the bug

Test fails frequently on Gradle:

org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest > testLoggingInterceptor_makeNetworkCall_succeeds FAILED
    java.lang.IllegalStateException: This job has not completed yet
        at kotlinx.coroutines.JobSupport.getCompletedInternal$kotlinx_coroutines_core(JobSupport.kt:1199)
        at kotlinx.coroutines.DeferredCoroutine.getCompleted(Builders.common.kt:100)
        at org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest.testLoggingInterceptor_makeNetworkCall_succeeds(NetworkLoggingInterceptorTest.kt:99)

Steps To Reproduce

Open a PR against develop to trigger a CI run

Expected Behavior

The test should always pass.

Screenshots/Videos

No response

What device/emulator are you using?

CI

Which Android version is your device/emulator running?

No response

Which version of the Oppia Android app are you using?

No response

Additional Context

Because of the frequency of this failure, we should consider annotating it with @flaky, should a fix not be straightforward.

The text was updated successfully, but these errors were encountered:

BenHenning · 2024-06-20T23:44:27Z

I could be wrong, but is there actually a way to mark this test as flaky in code? Bazel does technically support a way by auto-rerunning flaky tests, but I think it'd be better just to fix the test, instead, since we don't want to encourage checking in flaky code (where we can avoid it).

Currently running 100x runs to see how flaky the test is.

BenHenning · 2024-06-20T23:50:58Z

With 100x runs, I saw 19 failures:

testLoggingInterceptor_makeNetworkCall_succeeds 14/19 times (14% failure rate)
testLoggingInterceptor_makeNetworkCallWithInvalidUrl_failsAndCompletes 5/19 times (5% failure rate)

Edit: Also, here's the command run:

bazel test --runs_per_test=100 //data:src/test/java/org/oppia/android/data/backends/gae/NetworkLoggingInterceptorTest

Unfortunately, running the tests above in isolation doesn't seem to actually catch any flakes, so these flakes are seemingly caused by state interaction between the tests of the suite.

Specific failures that we're seeing for each:

testLoggingInterceptor_makeNetworkCall_succeeds:

testLoggingInterceptor_makeNetworkCall_succeeds(org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest)
java.lang.IllegalStateException: This job has not completed yet
        at kotlinx.coroutines.JobSupport.getCompletedInternal$kotlinx_coroutines_core(JobSupport.kt:1199)
        at kotlinx.coroutines.DeferredCoroutine.getCompleted(Builders.common.kt:100)
        at org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest.testLoggingInterceptor_makeNetworkCall_succeeds(NetworkLoggingInterceptorTest.kt:99)

testLoggingInterceptor_makeNetworkCallWithInvalidUrl_failsAndCompletes:

testLoggingInterceptor_makeNetworkCallWithInvalidUrl_failsAndCompletes(org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest)
java.lang.IllegalStateException: This job has not completed yet
        at kotlinx.coroutines.JobSupport.getCompletedInternal$kotlinx_coroutines_core(JobSupport.kt:1199)
        at kotlinx.coroutines.DeferredCoroutine.getCompleted(Builders.common.kt:100)
        at org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest.testLoggingInterceptor_makeNetworkCallWithInvalidUrl_failsAndCompletes(NetworkLoggingInterceptorTest.kt:123)

BenHenning · 2024-06-21T00:05:53Z

So the ultimate problem for both tests seems to be this line:

    val firstRequest = firstRequestsDeferred.getCompleted().single()

This was caused indirectly by #5402 since it fixed the tests to properly verify what they're trying to verify.

Note that #5402 also made similar changes to ConsoleLoggerTest. I verified that this test, fortunately, doesn't have any flakes per the following run:

bazel test --runs_per_test=100 //utility/src/test/java/org/oppia/android/util/logging:ConsoleLoggerTest

However, this is probably a false result since, per the previous comment, it seems that multiple tests need to cooperate in order to trigger the failure.

BenHenning · 2024-06-21T00:41:57Z

I've added a bunch of println instrumentation lines to better understand what's happening, and disabled the testLoggingInterceptor_makeCrashingNetworkCall_failsAndCompletes test just to make sure it isn't impacting the flakiness (and to reduce noise).

See https://gist.github.com/BenHenning/473e76df316acdfc99c609c363db155f for the instrumentation used for these results.

Here's the debug output when testLoggingInterceptor_makeNetworkCall_succeeds fails:

@@@@@@ [B] pre mockWebServer.enqueue(mockResponse)
@@@@@@ [B] pre request execute
@@@@@@ NetworkLoggingInterceptor.intercept()
@@@@@@ intercept() pre-launch success emit launch
@@@@@@ intercept() pre-launch fail 1 emit launch
@@@@@@ intercept() top-level is done
@@@@@@ [B] pre coroutines sync
@@@@@@ [B] pre networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ [B] pre networkLoggingInterceptor.logFailedNetworkCallFlow.take(1).toList()
@@@@@@ intercept() pre actual success emit()
@@@@@@ intercept() pre actual fail 1 emit()
@@@@@@ [B] post networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ [B] post networkLoggingInterceptor.logFailedNetworkCallFlow.take(1).toList()
@@@@@@ intercept() post actual fail 1 emit()
@@@@@@ intercept() post actual success emit()
@@@@@@ [B] pre test coroutine sync
@@@@@@ [B] pre firstRequestsDeferred get completed
@@@@@@ [B] pre firstFailingRequestsDeferred get completed
@@@@@@ [B] post all getCompleted()s
.@@@@@@ [A] pre mockWebServer.enqueue(MockResponse().setBody(testResponseBody))
@@@@@@ [A] pre request execute
@@@@@@ NetworkLoggingInterceptor.intercept()
@@@@@@ intercept() pre-launch success emit launch
@@@@@@ intercept() top-level is done
@@@@@@ [A] pre coroutines sync
@@@@@@ intercept() pre actual success emit()
@@@@@@ [A] pre networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ intercept() post actual success emit()
@@@@@@ [A] pre test coroutine sync
@@@@@@ [A] pre get completed
E
Time: 14.058
There was 1 failure:
1) testLoggingInterceptor_makeNetworkCall_succeeds(org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest)
java.lang.IllegalStateException: This job has not completed yet
        at kotlinx.coroutines.JobSupport.getCompletedInternal$kotlinx_coroutines_core(JobSupport.kt:1199)
        at kotlinx.coroutines.DeferredCoroutine.getCompleted(Builders.common.kt:100)
        at org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest.testLoggingInterceptor_makeNetworkCall_succeeds(NetworkLoggingInterceptorTest.kt:108)

FAILURES!!!
Tests run: 2,  Failures: 1

Here's the debug output when testLoggingInterceptor_makeNetworkCall_succeeds passes:

@@@@@@ [B] pre mockWebServer.enqueue(mockResponse)
@@@@@@ [B] pre request execute
@@@@@@ NetworkLoggingInterceptor.intercept()
@@@@@@ intercept() pre-launch success emit launch
@@@@@@ intercept() pre-launch fail 1 emit launch
@@@@@@ intercept() top-level is done
@@@@@@ [B] pre coroutines sync
@@@@@@ [B] pre networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ [B] pre networkLoggingInterceptor.logFailedNetworkCallFlow.take(1).toList()
@@@@@@ intercept() pre actual success emit()
@@@@@@ intercept() pre actual fail 1 emit()
@@@@@@ [B] post networkLoggingInterceptor.logFailedNetworkCallFlow.take(1).toList()
@@@@@@ [B] post networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ intercept() post actual fail 1 emit()
@@@@@@ intercept() post actual success emit()
@@@@@@ [B] pre test coroutine sync
@@@@@@ [B] pre firstRequestsDeferred get completed
@@@@@@ [B] pre firstFailingRequestsDeferred get completed
@@@@@@ [B] post all getCompleted()s
.@@@@@@ [A] pre mockWebServer.enqueue(MockResponse().setBody(testResponseBody))
@@@@@@ [A] pre request execute
@@@@@@ NetworkLoggingInterceptor.intercept()
@@@@@@ intercept() pre-launch success emit launch
@@@@@@ intercept() top-level is done
@@@@@@ [A] pre coroutines sync
@@@@@@ intercept() pre actual success emit()
@@@@@@ [A] pre networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ [A] post networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ intercept() post actual success emit()
@@@@@@ [A] pre test coroutine sync
@@@@@@ [A] pre get completed
@@@@@@ [A] post get completed

Time: 4.972

OK (2 tests)

The diff is interesting:

Specific takeaways:

logNetworkCallFlow and logFailedNetworkCallFlow are processed in reverse order in testLoggingInterceptor_makeNetworkCallWithInvalidUrl_failsAndCompletes in the passing case.
Fetching networkLoggingInterceptor.logNetworkCallFlow in testLoggingInterceptor_makeNetworkCall_succeeds actually finishes (plus getCompleted finishes--obviously that fails for the failure case since it's the actual exception being thrown).

I'm not certain the order difference matters here, and we're otherwise not getting interesting information on why the flow isn't completing.

BenHenning · 2024-06-21T01:19:03Z

Ah, I think I figured it out. Per https://kotlinlang.org/api/kotlinx.coroutines/kotlinx-coroutines-core/kotlinx.coroutines.flow/-mutable-shared-flow.html MutableSharedFlow has no replay caching by default which makes it time sensitive to subscriptions. We can see in the above logs that emit() is actually being called before we try to consume the flow which means there's a data race between starting to observe the flow and a value actually being delivered (events delivered will only block if there's another event being delivered, not if there zero subscribers--those are simply lost).

Adding another additional synchronization barrier seems to fix the problem by ensuring flow consumption happens before emit() is called.

adhiamboperes added the bug End user-perceivable behaviors which are not desirable. label Jun 18, 2024

adhiamboperes added this to [Team] Core Learner and Mastery flows & UI Frontend - Android Jun 18, 2024

github-project-automation bot moved this to Todo in [Team] Core Learner and Mastery flows & UI Frontend - Android Jun 18, 2024

adhiamboperes added Impact: Medium Moderate perceived user impact (non-blocking bugs and general improvements). Work: Low Solution is clear and broken into good-first-issue-sized chunks. good first issue This item is good for new contributors to make their pull request. labels Jun 18, 2024

BenHenning self-assigned this Jun 21, 2024

BenHenning mentioned this issue Jun 21, 2024

Fix #5430: Fix flakes in NetworkLoggingInterceptorTest #5436

Merged

6 tasks

BenHenning closed this as completed in #5436 Jun 21, 2024

BenHenning closed this as completed in 14976d9 Jun 21, 2024

github-project-automation bot moved this from Todo to Done in [Team] Core Learner and Mastery flows & UI Frontend - Android Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Flaky Test: testLoggingInterceptor_makeNetworkCall_succeeds #5430

[BUG]: Flaky Test: testLoggingInterceptor_makeNetworkCall_succeeds #5430

adhiamboperes commented Jun 18, 2024

BenHenning commented Jun 20, 2024

BenHenning commented Jun 20, 2024 •

edited

Loading

BenHenning commented Jun 21, 2024 •

edited

Loading

BenHenning commented Jun 21, 2024

BenHenning commented Jun 21, 2024

[BUG]: Flaky Test: testLoggingInterceptor_makeNetworkCall_succeeds #5430

[BUG]: Flaky Test: testLoggingInterceptor_makeNetworkCall_succeeds #5430

Comments

adhiamboperes commented Jun 18, 2024

Describe the bug

Steps To Reproduce

Expected Behavior

Screenshots/Videos

What device/emulator are you using?

Which Android version is your device/emulator running?

Which version of the Oppia Android app are you using?

Additional Context

BenHenning commented Jun 20, 2024

BenHenning commented Jun 20, 2024 • edited Loading

BenHenning commented Jun 21, 2024 • edited Loading

BenHenning commented Jun 21, 2024

BenHenning commented Jun 21, 2024

BenHenning commented Jun 20, 2024 •

edited

Loading

BenHenning commented Jun 21, 2024 •

edited

Loading