Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix part of #1861: [RunAllTests] Update Bazel CI workflow to run all Oppia tests #1904

Merged
merged 77 commits into from
Feb 12, 2021

Conversation

BenHenning
Copy link
Member

@BenHenning BenHenning commented Sep 24, 2020

Fix part of #1861.

At a high-level, this PR introduces running all of Oppia's Robolectric tests in GitHub Actions via Bazel now that Bazel has shown sufficient maturity while being used to continually build the app binary. In particular:

  1. The tests take much longer to run since Bazel's builds are more complicated than on Gradle. This has been partially mitigated by improving parallelization (e.g. running each test as its own action), among other techniques described below. However, even with these optimizations Gradle will still outperform Bazel in the long-term when it comes to running tests. This is a cost we need to accept as a team.
  2. One optimization done here is leveraging Bazel's remote caching functionality. However, to ensure the cache remains properly locked down, only branches created directly in this repo can leverage caching functionality. Consequently, forks from contributors will take longer to run through the actions. Despite caching, the overall performance benefits aren't as great as expected: only about 20% of the actions are being restored during build time. This is due to poor reproducibility in the data-binding part of the Android resources build pipeline and will require further investigation & fixes to address. The remote caching functionality is locked down using git-secret. Currently only I have access to the secrets, plus the fake robot account used by the workflow runner.
  3. The new actions actually introduce computations for affected targets. This means only the tests relevant to the underlying changes will be run, except if on develop (e.g. for continuous builds) in which case all tests will run, or if the PR title includes [RunAllTests]. While this is a nice addition, the real benefits won't be realized for a while: the lack of modularization means small changes (e.g. to just 1 app module file) will result in all tests tied to that module to need to re-run, and any changes to Bazel/BUILD files will trigger a re-run of the entire test suite since the Bazel files transitively expand to the whole set of BUILD files due to a circular dependency between domain & app. Changes to WORKSPACE will always trigger a full test re-run for safety.
  4. This PR introduces automatic workflow cancellation. This is even more important now that we're running so many actions at once.
  5. GitHub will try to run as many actions at once as it can. I've seen times when it only runs 5 simultaneously, and other times where it runs nearly 50. This depends on current active resources across the organization, so we'll need to keep an eye on runtime to see how good/bad it is for contributors.
  6. This PR also fixes a few build errors that have crept up in the last few weeks/months.
  7. Per (3) this PR is running all of the tests to demonstrate that they work as expected (though given that a Bazel file was changed in this PR, it would be likely that all of the tests would have been run, anyway).
  8. This PR includes a few test fixes that are specific to JDK 9 environments for unknown reasons. See Fix part of #1861: [RunAllTests] Update Bazel CI workflow to run all Oppia tests #1904 (comment) and comments above it for more context.
  9. A new workflow is being introduced to check that all tests passed. This is needed since the required checks in GitHub's settings can't require all test runs to be automatically required since it uses a matrix. Since this can change over time, we can't rely on specifying each test as required. Instead, we will use a static job that runs after all the others as the required check.

For some side context, #1876 required a lot of peripheral changes due to previous changes breaking Bazel test targets. This will help prevent that from happening. This PR will also help improve test reliability since it caught at least 1 test suite that had some timing issues that needed to be fixed that wasn't hit with Gradle (StateFragmentLocalTest).

FYI that #1861 is tracking long-term optimizations for improving Bazel builds in CI beyond those included in this PR.

Follow-up items to complete after this PR is merged:

  • 1-5. Verify Bazel CI test changes are working as expected #2691
  • 6. Introduce a wiki article explaining how to setup Bazel locally to investigate CI failures (this is going to be a bit rough initially since Bazel 4.0 isn't released yet, so the custom build of Bazel will need to be used in the interim). Now available here.

Note that GitHub seems to also have a nice visualization of the job dependency graph. For specifics, see: https://github.com/oppia/oppia-android/actions/runs/424948426.

Build all Oppia targets in Bazel rather than just the binary target.
@BenHenning BenHenning changed the title Update Bazel CI workflow Update Bazel CI workflow to build all Oppia targets Sep 24, 2020
Reset the workflow name so that it doesn't need to be renamed in GitHub settings.
@BenHenning
Copy link
Member Author

Note that the build will be failing until after #1876 is merged. Switching to draft for now.

@BenHenning BenHenning marked this pull request as draft September 24, 2020 20:39
@BenHenning
Copy link
Member Author

#1876 is now merged and this PR is up-to-date so it should pass. Sending it to review.

@BenHenning BenHenning marked this pull request as ready for review September 24, 2020 21:14
@BenHenning
Copy link
Member Author

Hmm, maybe this isn't feasible. CI killed this after an hour run--I wasn't expecting it to take that long.

@BenHenning BenHenning marked this pull request as draft September 24, 2020 23:34
@miaboloix
Copy link
Contributor

miaboloix commented Sep 25, 2020

Hmm, maybe this isn't feasible. CI killed this after an hour run--I wasn't expecting it to take that long.

Maybe we should try to create various tasks instead of one large task that builds all Bazel targets? I was thinking that since we already upload the Bazel binary we could download it for use in other tasks - maybe creating tasks that build all the Bazel targets for utility, then a task for model, etc. What do you think?

@miaboloix miaboloix assigned BenHenning and unassigned miaboloix Sep 25, 2020
@BenHenning
Copy link
Member Author

We seem to be able to build the Android binary in ~15 minutes which covers all of the libraries. I think there might be some base overhead to tests that are being shared across the 100 test suites, and CI isn't great at parallel builds. I think maybe building the binary & then sharing those build results with the test run might help: we can actually share all of the libraries this way, and hopefully the tests go by a bit faster. I think this means that even running all the tests with Bazel will exceeds 50 minutes, though, which is definitely problematic.

@BenHenning
Copy link
Member Author

So I've had some luck locally with setting up remote caching--this seems to decrease build times 2-5x which might make an optional build-all actions check viable. We'll probably want to split up the build based on modules, though, to improve parallelization.

@rt4914 rt4914 assigned BenHenning and unassigned rt4914 Feb 9, 2021
@BenHenning
Copy link
Member Author

Need to re-run everything due to Gradle flake. :(

@BenHenning BenHenning requested review from a team and seanlip and removed request for a team February 9, 2021 20:45
@BenHenning
Copy link
Member Author

@seanlip please approve codeowners change.

@seanlip
Copy link
Member

seanlip commented Feb 9, 2021

Done.

@BenHenning just to check, is pubring.kbx~ (with the tilde at the end) a file you intended to check in? It looks like some sort of backup so I thought I'd double-check.

@seanlip seanlip removed their assignment Feb 9, 2021
@BenHenning
Copy link
Member Author

BenHenning commented Feb 11, 2021

@seanlip so much as I can tell that file is expected to be present. I don't totally understand how git secret manages it internally, but it doesn't gitignore that file (which I would expect if it weren't necessary since it does gitignore others). It may be a necessary backup for the inner workings of git secret, not sure.

Thanks for double checking!

Conflicts:
	.github/workflows/main.yml
	domain/BUILD.bazel
	testing/BUILD.bazel
@BenHenning
Copy link
Member Author

BenHenning commented Feb 11, 2021

Note that I'm going to also split the Bazel tests into their own workflow so that failures don't require restarting all checks (only the Bazel ones). I've also filed the post-PR tracking issues now in preparation of merging this shortly.

@BenHenning
Copy link
Member Author

I think we also need a way to guarantee all tests are covered as we continue with the migration, otherwise it seems a bit too easy to miss things. Will follow up with something in a PR after this one.

This fixes some tests that were broken after recent PRs, and fixed a
visibility error introduced in #2663.
This will make it easier to restart failures without having to also
restart unrelated tests.
@BenHenning
Copy link
Member Author

I think I'm also going to move this to using Bazel 4.0 LTS since it will substantially speed up the cloning part of the workflow.

@BenHenning
Copy link
Member Author

Awesome, green PR. :) Going to merge this in as-is & add the optimization in a follow-up PR since this unlocks being able to fix a few different things.

@BenHenning
Copy link
Member Author

BenHenning commented Feb 12, 2021

Thanks all for the feedback & reviews. This was a bit more complicated of a PR, and took quite a bit of time to get it merge-worthy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants