Fix part of #1861: [RunAllTests] Update Bazel CI workflow to run all Oppia tests #1904

BenHenning · 2020-09-24T20:34:47Z

Fix part of #1861.

At a high-level, this PR introduces running all of Oppia's Robolectric tests in GitHub Actions via Bazel now that Bazel has shown sufficient maturity while being used to continually build the app binary. In particular:

The tests take much longer to run since Bazel's builds are more complicated than on Gradle. This has been partially mitigated by improving parallelization (e.g. running each test as its own action), among other techniques described below. However, even with these optimizations Gradle will still outperform Bazel in the long-term when it comes to running tests. This is a cost we need to accept as a team.
One optimization done here is leveraging Bazel's remote caching functionality. However, to ensure the cache remains properly locked down, only branches created directly in this repo can leverage caching functionality. Consequently, forks from contributors will take longer to run through the actions. Despite caching, the overall performance benefits aren't as great as expected: only about 20% of the actions are being restored during build time. This is due to poor reproducibility in the data-binding part of the Android resources build pipeline and will require further investigation & fixes to address. The remote caching functionality is locked down using git-secret. Currently only I have access to the secrets, plus the fake robot account used by the workflow runner.
The new actions actually introduce computations for affected targets. This means only the tests relevant to the underlying changes will be run, except if on develop (e.g. for continuous builds) in which case all tests will run, or if the PR title includes [RunAllTests]. While this is a nice addition, the real benefits won't be realized for a while: the lack of modularization means small changes (e.g. to just 1 app module file) will result in all tests tied to that module to need to re-run, and any changes to Bazel/BUILD files will trigger a re-run of the entire test suite since the Bazel files transitively expand to the whole set of BUILD files due to a circular dependency between domain & app. Changes to WORKSPACE will always trigger a full test re-run for safety.
This PR introduces automatic workflow cancellation. This is even more important now that we're running so many actions at once.
GitHub will try to run as many actions at once as it can. I've seen times when it only runs 5 simultaneously, and other times where it runs nearly 50. This depends on current active resources across the organization, so we'll need to keep an eye on runtime to see how good/bad it is for contributors.
This PR also fixes a few build errors that have crept up in the last few weeks/months.
Per (3) this PR is running all of the tests to demonstrate that they work as expected (though given that a Bazel file was changed in this PR, it would be likely that all of the tests would have been run, anyway).
This PR includes a few test fixes that are specific to JDK 9 environments for unknown reasons. See Fix part of #1861: [RunAllTests] Update Bazel CI workflow to run all Oppia tests #1904 (comment) and comments above it for more context.
A new workflow is being introduced to check that all tests passed. This is needed since the required checks in GitHub's settings can't require all test runs to be automatically required since it uses a matrix. Since this can change over time, we can't rely on specifying each test as required. Instead, we will use a static job that runs after all the others as the required check.

For some side context, #1876 required a lot of peripheral changes due to previous changes breaking Bazel test targets. This will help prevent that from happening. This PR will also help improve test reliability since it caught at least 1 test suite that had some timing issues that needed to be fixed that wasn't hit with Gradle (StateFragmentLocalTest).

FYI that #1861 is tracking long-term optimizations for improving Bazel builds in CI beyond those included in this PR.

Follow-up items to complete after this PR is merged:

1-5. Verify Bazel CI test changes are working as expected #2691
6. Introduce a wiki article explaining how to setup Bazel locally to investigate CI failures (this is going to be a bit rough initially since Bazel 4.0 isn't released yet, so the custom build of Bazel will need to be used in the interim). Now available here.

Note that GitHub seems to also have a nice visualization of the job dependency graph. For specifics, see: https://github.com/oppia/oppia-android/actions/runs/424948426.

Build all Oppia targets in Bazel rather than just the binary target.

Reset the workflow name so that it doesn't need to be renamed in GitHub settings.

BenHenning · 2020-09-24T20:39:02Z

Note that the build will be failing until after #1876 is merged. Switching to draft for now.

BenHenning · 2020-09-24T21:14:35Z

#1876 is now merged and this PR is up-to-date so it should pass. Sending it to review.

BenHenning · 2020-09-24T23:34:01Z

Hmm, maybe this isn't feasible. CI killed this after an hour run--I wasn't expecting it to take that long.

miaboloix · 2020-09-25T17:18:54Z

Hmm, maybe this isn't feasible. CI killed this after an hour run--I wasn't expecting it to take that long.

Maybe we should try to create various tasks instead of one large task that builds all Bazel targets? I was thinking that since we already upload the Bazel binary we could download it for use in other tasks - maybe creating tasks that build all the Bazel targets for utility, then a task for model, etc. What do you think?

BenHenning · 2020-09-27T17:02:42Z

We seem to be able to build the Android binary in ~15 minutes which covers all of the libraries. I think there might be some base overhead to tests that are being shared across the 100 test suites, and CI isn't great at parallel builds. I think maybe building the binary & then sharing those build results with the test run might help: we can actually share all of the libraries this way, and hopefully the tests go by a bit faster. I think this means that even running all the tests with Bazel will exceeds 50 minutes, though, which is definitely problematic.

BenHenning · 2020-11-18T21:03:01Z

So I've had some luck locally with setting up remote caching--this seems to decrease build times 2-5x which might make an optional build-all actions check viable. We'll probably want to split up the build based on modules, though, to improve parallelization.

This uses a remote storage service with a local file encrypted using git-secret to act as the authentication key for Bazel to read & write artifacts to the service for caching purposes.

BenHenning · 2021-02-09T20:44:55Z

Need to re-run everything due to Gradle flake. :(

BenHenning · 2021-02-09T20:45:52Z

@seanlip please approve codeowners change.

seanlip · 2021-02-09T21:32:10Z

Done.

@BenHenning just to check, is pubring.kbx~ (with the tilde at the end) a file you intended to check in? It looks like some sort of backup so I thought I'd double-check.

BenHenning · 2021-02-11T22:33:21Z

@seanlip so much as I can tell that file is expected to be present. I don't totally understand how git secret manages it internally, but it doesn't gitignore that file (which I would expect if it weren't necessary since it does gitignore others). It may be a necessary backup for the inner workings of git secret, not sure.

Thanks for double checking!

Conflicts: .github/workflows/main.yml domain/BUILD.bazel testing/BUILD.bazel

BenHenning · 2021-02-11T23:04:52Z

Note that I'm going to also split the Bazel tests into their own workflow so that failures don't require restarting all checks (only the Bazel ones). I've also filed the post-PR tracking issues now in preparation of merging this shortly.

BenHenning · 2021-02-11T23:09:30Z

I think we also need a way to guarantee all tests are covered as we continue with the migration, otherwise it seems a bit too easy to miss things. Will follow up with something in a PR after this one.

This fixes some tests that were broken after recent PRs, and fixed a visibility error introduced in #2663.

This will make it easier to restart failures without having to also restart unrelated tests.

BenHenning · 2021-02-11T23:15:02Z

I think I'm also going to move this to using Bazel 4.0 LTS since it will substantially speed up the cloning part of the workflow.

BenHenning · 2021-02-12T00:05:24Z

Awesome, green PR. :) Going to merge this in as-is & add the optimization in a follow-up PR since this unlocks being able to fix a few different things.

BenHenning · 2021-02-12T00:05:47Z

Thanks all for the feedback & reviews. This was a bit more complicated of a PR, and took quite a bit of time to get it merge-worthy.

Update Bazel CI workflow

34e00da

Build all Oppia targets in Bazel rather than just the binary target.

BenHenning requested a review from miaboloix September 24, 2020 20:34

BenHenning assigned miaboloix Sep 24, 2020

BenHenning changed the title ~~Update Bazel CI workflow~~ Update Bazel CI workflow to build all Oppia targets Sep 24, 2020

Update main.yml

3f0cd72

Reset the workflow name so that it doesn't need to be renamed in GitHub settings.

BenHenning marked this pull request as draft September 24, 2020 20:39

Merge branch 'develop' into build-all-targets-in-bazel-ci

a94e160

BenHenning marked this pull request as ready for review September 24, 2020 21:14

BenHenning marked this pull request as draft September 24, 2020 23:34

miaboloix assigned BenHenning and unassigned miaboloix Sep 25, 2020

BenHenning mentioned this pull request Sep 29, 2020

Shift NumericInput Rules test to separate sub-package #1882

Closed

Merge branch 'develop' into build-all-targets-in-bazel-ci

ec990f6

BenHenning added 11 commits November 18, 2020 13:56

Introduce remote caching in Bazel.

5c92f2c

This uses a remote storage service with a local file encrypted using git-secret to act as the authentication key for Bazel to read & write artifacts to the service for caching purposes.

Add debug line.

9d026e9

Disable workflows + fix debug line.

4ac50c6

More debugging.

93c3608

More debugging.

4df788f

Work around GitHub hiding secret since we're debugging.

b292826

Use base64 to properly encode newlines in GPG keys.

5bbc72a

Remove debug lines before changing back to correct GPG key.

c9b9d0b

Switch to production key.

ca3fadc

Fix env variable reference. Lock-down actions workflows via codeowners.

7854318

Install git-secret to default location.

74a775b

rt4914 assigned BenHenning and unassigned rt4914 Feb 9, 2021

BenHenning requested review from a team and seanlip and removed request for a team February 9, 2021 20:45

BenHenning assigned seanlip Feb 9, 2021

seanlip approved these changes Feb 9, 2021

View reviewed changes

seanlip removed their assignment Feb 9, 2021

Merge branch 'develop' into build-all-targets-in-bazel-ci

3312174

Conflicts: .github/workflows/main.yml domain/BUILD.bazel testing/BUILD.bazel

This was referenced Feb 11, 2021

Verify Bazel CI test changes are working as expected #2691

Closed

Investigate sharing shared workflow functionality across workflows/jobs in GitHub Actions #2692

Open

BenHenning added 2 commits February 11, 2021 15:09

Post-merge fixes.

b44ae3a

This fixes some tests that were broken after recent PRs, and fixed a visibility error introduced in #2663.

Move Bazel tests to new workflow.

1a3f4e9

This will make it easier to restart failures without having to also restart unrelated tests.

BenHenning merged commit 9173c96 into develop Feb 12, 2021

BenHenning deleted the build-all-targets-in-bazel-ci branch February 12, 2021 00:06

BenHenning restored the build-all-targets-in-bazel-ci branch October 25, 2021 21:39

BenHenning deleted the build-all-targets-in-bazel-ci branch October 25, 2021 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix part of #1861: [RunAllTests] Update Bazel CI workflow to run all Oppia tests #1904

Fix part of #1861: [RunAllTests] Update Bazel CI workflow to run all Oppia tests #1904

BenHenning commented Sep 24, 2020 •

edited

Loading

BenHenning commented Sep 24, 2020

BenHenning commented Sep 24, 2020

BenHenning commented Sep 24, 2020

miaboloix commented Sep 25, 2020 •

edited

Loading

BenHenning commented Sep 27, 2020

BenHenning commented Nov 18, 2020

BenHenning commented Feb 9, 2021

BenHenning commented Feb 9, 2021

seanlip commented Feb 9, 2021

BenHenning commented Feb 11, 2021 •

edited

Loading

BenHenning commented Feb 11, 2021 •

edited

Loading

BenHenning commented Feb 11, 2021

BenHenning commented Feb 11, 2021

BenHenning commented Feb 12, 2021

BenHenning commented Feb 12, 2021 •

edited

Loading

Fix part of #1861: [RunAllTests] Update Bazel CI workflow to run all Oppia tests #1904

Fix part of #1861: [RunAllTests] Update Bazel CI workflow to run all Oppia tests #1904

Conversation

BenHenning commented Sep 24, 2020 • edited Loading

BenHenning commented Sep 24, 2020

BenHenning commented Sep 24, 2020

BenHenning commented Sep 24, 2020

miaboloix commented Sep 25, 2020 • edited Loading

BenHenning commented Sep 27, 2020

BenHenning commented Nov 18, 2020

BenHenning commented Feb 9, 2021

BenHenning commented Feb 9, 2021

seanlip commented Feb 9, 2021

BenHenning commented Feb 11, 2021 • edited Loading

BenHenning commented Feb 11, 2021 • edited Loading

BenHenning commented Feb 11, 2021

BenHenning commented Feb 11, 2021

BenHenning commented Feb 12, 2021

BenHenning commented Feb 12, 2021 • edited Loading

BenHenning commented Sep 24, 2020 •

edited

Loading

miaboloix commented Sep 25, 2020 •

edited

Loading

BenHenning commented Feb 11, 2021 •

edited

Loading

BenHenning commented Feb 11, 2021 •

edited

Loading

BenHenning commented Feb 12, 2021 •

edited

Loading