[RELEASE]: Stabilize the integration test runs for distribution builds, no manual sign-offs anymore #4588

reta · 2024-03-28T16:36:05Z

dblock · 2024-03-28T17:19:20Z

I support this proposal. As far as I know, manual sign off caused delay in releasing 5+ minor versions (as additional time was required for manual sign off), and resulted in 7 patch releases (as bugs slipped through manual sign off) in 2023.

dblock · 2024-04-01T20:32:09Z

Here's a link to what's being manually signed off in 2.13: #4433 (comment)

Pallavi-AWS · 2024-04-01T21:24:57Z

I am supportive - thanks @reta for pursuing this. However, given the number of manual signoffs/flaky tests, can we agree on a % reduction goal for 2.14, followed by targets for 2.15 and beyond such that we can get down to 0 manual signoffs in 3-4 months. All teams are impacted and are burning down, however, the pace is slower than we expect.

bbarani · 2024-04-01T22:00:23Z

The below components have most number of flaky tests. If we are taking the % reduction goal, we should take it at component level.

OpenSearch:

Reporting
SQL
Neural search
ml commons
Index management
alerting
CCR
Notifications
security-analytics

OpenSearch Dashboards:

Dashboards core
security analytics dashboards
IM Dashboards
AD dashboards
reports dashboards
Dashboard reporting
ganttchart dashboards
observability dashboards
alerting dashboards

Pallavi-AWS · 2024-04-02T10:52:15Z

Thanks @bbarani - can we take a goal of 50% reduction for each component by 2.14? It will be difficult to manage different reduction targets across different components.

dblock · 2024-04-02T15:33:40Z

Thanks @bbarani - can we take a goal of 50% reduction for each component by 2.14? It will be difficult to manage different reduction targets across different components.

Why 50% and not 100%? In 2.14 there were only 4 plugins that didn't manually sig-off. Any team that doesn't have stable tests should be focused on that and not adding new features.

Pallavi-AWS · 2024-04-02T15:44:58Z

Caught up with @bbarani, @dblock and @andrross. Let's prioritize eliminating the manual signoffs in 2.14 - we do not ship 2.14 unless we get green builds across core and repos.

gaiksaya · 2024-04-03T21:13:18Z

I have created child issues for the components that manually signed off in 2.13.0. See above links.
Will tag the maintainers after a week if there is no response on any of them.

Thanks!

vikasvb90 · 2024-04-09T02:32:19Z

@bbarani You mentioned ISM as one of the components having high number of flaky tests. This isn't true and may not be true for some of the other plugins as well. As I called out in release meeting as well, for ISM primary reason of failures was update of wrong security certificates in the repo. This update in security certificate was suggested to plugins by security earlier to increase the expiry and fixed later by another PR opensearch-project/index-management#1147 by security itself.
But I agree that there are still some flaky tests which fail on certain platforms intermittently but I haven't seen them blocking the release.
@gaiksaya Can I request you to change status of ISM release from manual sign-off to fixed tests so that it gives the right picture because we did not provide any manual sign-off? I am assuming that you referred to the passed builds of PR instead of re-executing jenkins for the sign-off.

gaiksaya · 2024-04-09T04:10:02Z

Hi @vikasvb90 ,

Even before security changes were merged, index management plugin has seen lot of flaky tests in the past. See the issues opened and closed multiple times in the past releases here. The re-executing of jenkins would not have fixed it as a new RC would have to be generated as integration test framework checks out the exact commit that was used to build RC.

We also recently matched the GitHub Actions specifications on Jenkins. See opensearch-project/opensearch-ci#412
We can use 2.14.0 release as testing environment for this. Would that be okay?

vikasvb90 · 2024-04-09T06:21:45Z

@gaiksaya I didn't mean that there are no flaky tests in ISM, I have mentioned this in my comment as well. We do have some flaky tests and we have been making improvements as well. I am pointing out the reason of failures in the last release based on which this was stated most number of flaky tests
So, to re-iterate, 2.13 got blocked specifically because of security and ISM had nothing to do with it.

bbarani · 2024-04-09T14:45:08Z

@vikasvb90, thank for your comment. This issue mainly focuses on remediating all flaky tests and doesn't focus on specific released OpenSearch version. The goal is to identify flaky tests and fix it before 2.14.0 release.

ashwin-pc · 2024-05-13T16:18:06Z

OpenSearch Dashboards did not pass the build CI for 2.14 even though it passed these tests locally and it it's own component CI. I will work with the build team to address it.

Pallavi-AWS · 2024-05-13T16:26:29Z

Thanks @ashwin-pc for getting the Dashboards testcases to run locally - we need all testcases running as part of CI in 2.15 without exceptions.

ashwin-pc · 2024-06-08T01:35:15Z

Update on passing the tests in the Build CI to avoid a manual signoff for 2.15:

To avoid having manual signoffs for OSD in 2.15, we need to split the OSD tests into groups. In 2.14 we had tests pass when run individually but fail when all the tests are run together. It appears that the issue doesn't lie with individual tests for OpenSearch Dashboards, but rather when running all the tests together. The OpenSearch Dashboards Github CI does not face this issue because we use CI groups to run the tests. The Functional Test Repo team has also observed that splitting the tests into separate CI groups helped resolve the failures.

We need to implement a similar approach for the Build CI as well. Instead of running all the tests together, we should partition them into smaller groups or batches. This way, if any test leaves behind residual data or fails to clean up properly, it won't affect the subsequent tests in the same batch.

While this may not address the root cause of the flakiness, it aligns with the previously successful solution of partitioning tests. Historically we have learned that if OSD tests run sequentially, we should expect failures. Investigating and fixing the underlying issues would require significant research efforts, which may not be feasible given the large number of inherited tests.

By partitioning the tests, we can mitigate the impact of flakiness and ensure more reliable test runs.

ashwin-pc · 2024-06-08T01:38:55Z

One alternative to this is to use @AMoo-Miki 's pipeline that has already implemented this grouping, that can use a built artifact and run the tests in groups.

reta · 2024-06-10T14:36:44Z

By partitioning the tests, we can mitigate the impact of flakiness and ensure more reliable test runs.

I think as a workaround, grouping looks like a step in right direction but one may still face the flakyness in scope of the single group since the cause (why sequential test runs are flaky) is still not understood.

rishabh6788 · 2024-06-10T20:55:49Z

By partitioning the tests, we can mitigate the impact of flakiness and ensure more reliable test runs.

I think as a workaround, grouping looks like a step in right direction but one may still face the flakyness in scope of the single group since the cause (why sequential test runs are flaky) is still not understood.

Another concern that I have is that the possibility of detecting a valid regression reduces when running in groups. For e.g., if a feature that causes performance regression is put in one group and by any chance is the last test of the particular group then subsequent groups will not detect that issue. As of now if that happens, the subsequent tests fail with timeout or other related errors.

rishabh6788 · 2024-06-10T21:00:46Z

I believe @SuZhou-Joe is looking into flakey test failures and has made some discoveries about their probable cause.
It will be helpful if you could add your findings on this issue.
cc: @peterzhuamazon

peterzhuamazon · 2024-06-10T21:15:44Z

I believe @SuZhou-Joe is looking into flakey test failures and has made some discoveries about their probable cause. It will be helpful if you could add your findings on this issue. cc: @peterzhuamazon

Thanks @rishabh6788 .

Also waiting for @SuZhou-Joe to provide a possible fix to the root problem.
This should give us the confidence we need to proceed.

The initial changes on github actions seems like avoiding the issues rather than fixing it permanently:

[Infra] Add CI groups for OSD core test cases to avoid flaky test opensearch-dashboards-functional-test#1352

Thanks.

ashwin-pc · 2024-06-12T04:58:13Z

I think as a workaround, grouping looks like a step in right direction but one may still face the flakyness in scope of the single group since the cause (why sequential test runs are flaky) is still not understood.

@reta Flakeyness within a single group is manageable since tests wont interfere with each other as much and all that the author of the test needs to ensure is that the group runs successfully. And because they are smaller groups, validating fixes to a group should be a lot quicker and more reproducible. Every fix made to the flakey tests so far has been addressing tests interfering with each other and not the core functionality/regression itself. And when each test run takes hours to complete, it makes debugging and addressing these tests so much harder.

Also we do know why the grouping fixes the tests. The CI uses a single low powered OpenSearch instance to run all the tests. For OpenSearch Dashboards, we have a lot more tests as compared to plugins, so when we run all the tests sequentially, even though we clean up the data after the tests, OpenSearch does not clear its cache in time and starts to throttle the requests. @SuZhou-Joe came up with a solution to clear the cache for now, but relying on deep dives like that isnt scalable.

By partitioning the tests, we can mitigate the impact of flakiness and ensure more reliable test runs.

I think as a workaround, grouping looks like a step in right direction but one may still face the flakyness in scope of the single group since the cause (why sequential test runs are flaky) is still not understood.

Another concern that I have is that the possibility of detecting a valid regression reduces when running in groups. For e.g., if a feature that causes performance regression is put in one group and by any chance is the last test of the particular group then subsequent groups will not detect that issue. As of now if that happens, the subsequent tests fail with timeout or other related errors.

@rishabh6788 I think this is unnecessary. Each of the fixes so far, including the ones that @SuZhou-Joe 's latest fix for running them sequentially has been to address tests interfering with each other. His fix actually just asks OpenSearch to clear its cache so that the next set of tests can run. This is actually a workaround. UI tests expect the application to be in a specific state to not be flakey and expecting them to work sequentally only makes this more fragile.

Also if these kinds of regressions are really a concern, why arent we running all functional tests (including that of other plugins) sequentially?

If we really want to catch accidental regressions, we should introduce fuzzy tests which by definition will be flakey but can catch errors like the ones that you are concerned about. As for the other integ tests, they are meant to validate a specific function and arent written to valid across tests.

andrross · 2024-06-12T14:57:42Z

Tests interfering with one another (either through concurrent execution or leaving modified state after test runs) is definitely a common problem. We should continue working towards a real solution to this problem so that tests are independent and reliable regardless of the context in which they run. Changing the test run grouping may be a good mitigation for now, but if we don't actually solve the root problem of test interference then we're likely to continue struggling with this going forward.

we should introduce fuzzy tests which by definition will be flakey

Fuzz tests should not be flaky. They should be deterministically reproducible with a given random seed, and any failures they uncover would be real bugs that need to be fixed.

The CI uses a single low powered OpenSearch

CI is using m5.8xlarge instances as far as I know. Is this not correct? This is a 32 core, 128MB memory instance. If using a more powerful machine will help I think we can explore doing that.

reta added enhancement New Enhancement untriaged Issues that have not yet been triaged labels Mar 28, 2024

dblock removed the untriaged Issues that have not yet been triaged label Mar 28, 2024

gaiksaya mentioned this issue Apr 4, 2024

[Retrospective] Release Version 2.13.0 #4602

Closed

gaiksaya added the v2.14.0 label Apr 8, 2024

dblock pinned this issue Apr 15, 2024

dblock assigned ashwin-pc May 13, 2024

dblock added the v2.15.0 label May 13, 2024

peterzhuamazon mentioned this issue Jun 10, 2024

[RELEASE] Release version 2.15.0 #4681

Closed

74 tasks

peterzhuamazon unpinned this issue Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RELEASE]: Stabilize the integration test runs for distribution builds, no manual sign-offs anymore #4588

[RELEASE]: Stabilize the integration test runs for distribution builds, no manual sign-offs anymore #4588

reta commented Mar 28, 2024 •

edited by dblock

Loading

dblock commented Mar 28, 2024

dblock commented Apr 1, 2024

Pallavi-AWS commented Apr 1, 2024

bbarani commented Apr 1, 2024

Pallavi-AWS commented Apr 2, 2024

dblock commented Apr 2, 2024

Pallavi-AWS commented Apr 2, 2024

gaiksaya commented Apr 3, 2024

vikasvb90 commented Apr 9, 2024

gaiksaya commented Apr 9, 2024

vikasvb90 commented Apr 9, 2024 •

edited

Loading

bbarani commented Apr 9, 2024 •

edited

Loading

ashwin-pc commented May 13, 2024

Pallavi-AWS commented May 13, 2024

ashwin-pc commented Jun 8, 2024

ashwin-pc commented Jun 8, 2024

reta commented Jun 10, 2024

rishabh6788 commented Jun 10, 2024

rishabh6788 commented Jun 10, 2024 •

edited

Loading

peterzhuamazon commented Jun 10, 2024 •

edited

Loading

ashwin-pc commented Jun 12, 2024 •

edited

Loading

andrross commented Jun 12, 2024

[RELEASE]: Stabilize the integration test runs for distribution builds, no manual sign-offs anymore #4588

[RELEASE]: Stabilize the integration test runs for distribution builds, no manual sign-offs anymore #4588

Comments

reta commented Mar 28, 2024 • edited by dblock Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Children

dblock commented Mar 28, 2024

dblock commented Apr 1, 2024

Pallavi-AWS commented Apr 1, 2024

bbarani commented Apr 1, 2024

Pallavi-AWS commented Apr 2, 2024

dblock commented Apr 2, 2024

Pallavi-AWS commented Apr 2, 2024

gaiksaya commented Apr 3, 2024

vikasvb90 commented Apr 9, 2024

gaiksaya commented Apr 9, 2024

vikasvb90 commented Apr 9, 2024 • edited Loading

bbarani commented Apr 9, 2024 • edited Loading

ashwin-pc commented May 13, 2024

Pallavi-AWS commented May 13, 2024

ashwin-pc commented Jun 8, 2024

ashwin-pc commented Jun 8, 2024

reta commented Jun 10, 2024

rishabh6788 commented Jun 10, 2024

rishabh6788 commented Jun 10, 2024 • edited Loading

peterzhuamazon commented Jun 10, 2024 • edited Loading

ashwin-pc commented Jun 12, 2024 • edited Loading

andrross commented Jun 12, 2024

reta commented Mar 28, 2024 •

edited by dblock

Loading

vikasvb90 commented Apr 9, 2024 •

edited

Loading

bbarani commented Apr 9, 2024 •

edited

Loading

rishabh6788 commented Jun 10, 2024 •

edited

Loading

peterzhuamazon commented Jun 10, 2024 •

edited

Loading

ashwin-pc commented Jun 12, 2024 •

edited

Loading