Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Reprovision Workflow IT is flaky #870

Closed
dbwiddis opened this issue Sep 12, 2024 · 5 comments · Fixed by #918
Closed

[BUG] Reprovision Workflow IT is flaky #870

dbwiddis opened this issue Sep 12, 2024 · 5 comments · Fixed by #918
Assignees
Labels
bug Something isn't working v2.18.0 Issues targeting release v2.18.0

Comments

@dbwiddis
Copy link
Member

dbwiddis commented Sep 12, 2024

What is the bug?

FlowFrameworkRestApiIT.testReprovisionWorkflow() is flaky. During 2.17.0 release it required multiple retries to pass on Windows x86. There are occasional GitHub CI failures on other platforms, see https://github.com/opensearch-project/flow-framework/actions/runs/10777769278/job/29887668921#step:4:59 (Ubuntu) and https://github.com/opensearch-project/flow-framework/actions/runs/10777865339/job/29888745661#step:4:49 (macOS)

FlowFrameworkRestApiIT > testReprovisionWorkflow FAILED
    org.junit.ComparisonFailure: expected:<[COMPLET]ED> but was:<[FAIL]ED>
        at __randomizedtesting.SeedInfo.seed([4D7D6[49](https://github.com/opensearch-project/flow-framework/actions/runs/10777865339/job/29888745661#step:4:50)945752BD3:196DC17D8E477B7D]:0)
        at org.junit.Assert.assertEquals(Assert.java:117)
        at org.junit.Assert.assertEquals(Assert.java:146)
        at org.opensearch.flowframework.FlowFrameworkRestTestCase.getAndAssertWorkflowStatus(FlowFrameworkRestTestCase.java:7[53](https://github.com/opensearch-project/flow-framework/actions/runs/10777865339/job/29888745661#step:4:54))
        at org.opensearch.flowframework.FlowFrameworkRestTestCase.lambda$getResourcesCreated$7(FlowFrameworkRestTestCase.java:805)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1136)
        at org.opensearch.flowframework.FlowFrameworkRestTestCase.getResourcesCreated(FlowFrameworkRestTestCase.java:804)
        at org.opensearch.flowframework.rest.FlowFrameworkRestApiIT.testReprovisionWorkflow(FlowFrameworkRestApiIT.java:471)

How can one reproduce the bug?

Run our CI repeatedly. It fails reasonably often.

What is the expected behavior?

Tests consistently pass.

What is your host/environment?

Reproducible on Windows, Linux, macOS as shown above

Do you have any additional context?

The failure is occurring consistently on the attempt to reprovision to remove default ingest pipeline (setting it to _none, which is the SearchPipelineService.NOOP_PIPELINE_ID).

The change in default pipeline name is the only difference in the two templates.

@dbwiddis
Copy link
Member Author

dbwiddis commented Sep 13, 2024

Observations:

  1. The logs are confusing because the execution sequence uses the original step id's, even though there are "update" steps or "workflow data" steps under the hood. Logging should be improved to include not only the id ("create_index") in this case, but the type (which is actually supposed to be "update_index").
» [2024-09-09T17:07:20,500][INFO ][o.o.f.t.ReprovisionWorkflowTransportAction] [integTest-0] Queueing process [create_openai_connector]. Can start immediately!
» [2024-09-09T17:07:20,500][INFO ][o.o.f.t.ReprovisionWorkflowTransportAction] [integTest-0] Queueing process [register_openai_model]. Must wait for [create_openai_connector] to complete first.
» [2024-09-09T17:07:20,500][INFO ][o.o.f.t.ReprovisionWorkflowTransportAction] [integTest-0] Queueing process [create_ingest_pipeline]. Must wait for [register_openai_model] to complete first.
» [2024-09-09T17:07:20,500][INFO ][o.o.f.t.ReprovisionWorkflowTransportAction] [integTest-0] Updated template for hE7B15EBOAAJWyIRCqLQ
» [2024-09-09T17:07:20,500][INFO ][o.o.f.t.ReprovisionWorkflowTransportAction] [integTest-0] Queueing process [create_index]. Must wait for [create_ingest_pipeline] to complete first.
» [2024-09-09T17:07:20,500][INFO ][o.o.f.w.ProcessNode      ] [integTest-0] Starting create_openai_connector.
» [2024-09-09T17:07:20,500][INFO ][o.o.f.w.ProcessNode      ] [integTest-0] Finished create_openai_connector.
» [2024-09-09T17:07:20,500][INFO ][o.o.f.w.ProcessNode      ] [integTest-0] Starting register_openai_model.
» [2024-09-09T17:07:20,500][INFO ][o.o.f.w.ProcessNode      ] [integTest-0] Finished register_openai_model.
» [2024-09-09T17:07:20,501][INFO ][o.o.f.w.ProcessNode      ] [integTest-0] Starting create_ingest_pipeline.
» [2024-09-09T17:07:20,501][INFO ][o.o.f.w.ProcessNode      ] [integTest-0] Finished create_ingest_pipeline.
» [2024-09-09T17:07:20,501][INFO ][o.o.f.w.ProcessNode      ] [integTest-0] Starting create_index.
» [2024-09-09T17:07:20,503][INFO ][o.o.f.t.GetWorkflowStateTransportAction] [integTest-0] Querying state workflow doc: hE7B15EBOAAJWyIRCqLQ
» [2024-09-09T17:07:20,535][INFO ][o.o.f.t.ReprovisionWorkflowTransportAction] [integTest-0] updated workflow hE7B15EBOAAJWyIRCqLQ state to FAILED
  1. Looking at WorkflowProcessSorter.createProcessNode(), the expectation is the ProcessNode properly looks up the Update step in Case 2, which calls createUpdateProcessNode().
        String updateStepName = WorkflowResources.getUpdateStepByWorkflowStep(node.type());
        if (updateStepName != null) {
            WorkflowStep step = workflowStepFactory.createStep(updateStepName);
            return new ProcessNode(
                node.id(),
                step,

But the stack trace indicates it's executing CreateIndexStep:

» ERROR][o.o.f.t.ReprovisionWorkflowTransportAction] [integTest-0] Reprovisioning failed for workflow hE7B15EBOAAJWyIRCqLQ during step create_index.
»  org.opensearch.flowframework.exception.WorkflowStepException: Failed to create the index my-nlp-index
»  	at org.opensearch.flowframework.workflow.CreateIndexStep.lambda$execute$1(CreateIndexStep.java:120) ~[?:?]
  1. That indicates we're matching case 1:
        if (!originalTemplateMap.containsKey(node.id())) {
            // Case 1: Additive modification, create new node
            return createNewProcessNode(node, data, predecessorNodes, nodeTimeout);
  1. Going back to the test case, there are actually two reprovisionings that occur; the original template provisioned only has the first 3 steps. The "create index" is added with the first reprovision as a 4th step to trigger Case 1 (which completes successfully). But it appears the second reprovision may be comparing vs. the original template.
  2. The original template is populated by a GET request early in the reprovisioning process. It appears that in this case it may not have been updated.
  3. Looking at ReprovisionWorkflowTransportAction.executeWorkflow() we see that the state index is updated first, before the template is replaced in the global context:
            flowFrameworkIndicesHandler.updateFlowFrameworkSystemIndexDoc(
                workflowId,
                Map.ofEntries(
                    Map.entry(STATE_FIELD, State.COMPLETED),
                    Map.entry(PROVISIONING_PROGRESS_FIELD, ProvisioningProgress.DONE),
                    Map.entry(PROVISION_END_TIME_FIELD, Instant.now().toEpochMilli())
                ),
                ActionListener.wrap(updateResponse -> {

                    logger.info("updated workflow {} state to {}", workflowId, State.COMPLETED);

                    // Replace template document
                    flowFrameworkIndicesHandler.updateTemplateInGlobalContext(
                        workflowId,
                        template,
                        ActionListener.wrap(templateResponse -> {
                            logger.info("Updated template for {}", workflowId, State.COMPLETED);
                        }, exception -> {
                            String errorMessage = "Failed to update use case template for " + workflowId;
                            logger.error(errorMessage, exception);
                        }),
                        true  // ignores NOT_STARTED state if request is to reprovision
                    );
                }, exception -> { logger.error("Failed to update workflow state for workflow {}", workflowId, exception); })
            );

This reversed sequence is actually evident in the logs, a portion actually included in observation 1 above. I saw it and thought it was unusual, and now it makes complete sense.

  • The first workflow completed at 07:20,325.
  • The test moved on to the next reprovision, querying the (old) template at 07:20,391.
  • The first reprovision belatedly updated the template at 07:20,500, 109 milliseconds too late.
» [2024-09-09T17:07:20,325][INFO ][o.o.f.t.ReprovisionWorkflowTransportAction] [integTest-0] Reprovisioning completed successfully for workflow hE7B15EBOAAJWyIRCqLQ
» [2024-09-09T17:07:20,391][INFO ][o.o.f.t.CreateWorkflowTransportAction] [integTest-0] Querying existing workflow from global context: hE7B15EBOAAJWyIRCqLQ
» [2024-09-09T17:07:20,394][INFO ][o.o.f.t.CreateWorkflowTransportAction] [integTest-0] Reprovisioning parameter is set, continuing to reprovision workflow hE7B15EBOAAJWyIRCqLQ
» [2024-09-09T17:07:20,394][INFO ][o.o.f.t.ReprovisionWorkflowTransportAction] [integTest-0] Querying state for workflow: hE7B15EBOAAJWyIRCqLQ
» [2024-09-09T17:07:20,394][INFO ][o.o.f.t.GetWorkflowStateTransportAction] [integTest-0] Querying state workflow doc: hE7B15EBOAAJWyIRCqLQ
» [2024-09-09T17:07:20,448][INFO ][o.o.f.t.ReprovisionWorkflowTransportAction] [integTest-0] updated workflow hE7B15EBOAAJWyIRCqLQ state to COMPLETED
» [2024-09-09T17:07:20,499][INFO ][o.o.f.t.ReprovisionWorkflowTransportAction] [integTest-0] Updated workflow hE7B15EBOAAJWyIRCqLQ state to PROVISIONING
» [2024-09-09T17:07:20,500][INFO ][o.o.f.t.ReprovisionWorkflowTransportAction] [integTest-0] Queueing process [create_openai_connector]. Can start immediately!
» [2024-09-09T17:07:20,500][INFO ][o.o.f.t.ReprovisionWorkflowTransportAction] [integTest-0] Queueing process [register_openai_model]. Must wait for [create_openai_connector] to complete first.
» [2024-09-09T17:07:20,500][INFO ][o.o.f.t.ReprovisionWorkflowTransportAction] [integTest-0] Queueing process [create_ingest_pipeline]. Must wait for [register_openai_model] to complete first.
» [2024-09-09T17:07:20,500][INFO ][o.o.f.t.ReprovisionWorkflowTransportAction] [integTest-0] Updated template for hE7B15EBOAAJWyIRCqLQ

Conclusion: We have a race condition in the test: the first reprovision (which adds a create_index step) completes successfully and updates the state to completed; which in the test case moves on to reprovision a second time. It fails in the case where the GET occurs before the update, fetching the earlier copy of the template.

Needed fix: Make sure the template is updated before reporting the state as COMPLETED. Given that we typically update the workflow before provisioning, it would make sense to do it before executing the reprovisioning steps, but after validation; basically at a point where we'll either complete or fail with an error. Alternately, try to do both updates simultaneously using a latch (see ML Commons delete model for an example).

Also improve the logging to more clearly indicate the type of reprovision step being used. :)

@dbwiddis
Copy link
Member Author

@gaiksaya do you want us to try to merge a fix for this into 2.17.1?

@gaiksaya
Copy link
Member

@gaiksaya do you want us to try to merge a fix for this into 2.17.1?

If it is just a test fix then sure @dbwiddis.
cc: @getsaurabh02 @Divyaasm

@dbwiddis
Copy link
Member Author

@dbwiddis
Copy link
Member Author

For clarity, PR #882 fixed the integration test, but there are still some underlying issues that need addressing in another PR:

  • Logging is unclear as it is using the old step id (e.g., create_index) rather than the name of the step actually being used (e.g., update_index).
  • The template update occurs after the completion of all the reprovisioning (unless it fails in which case the original template remains). This is inconsistent with normal provisioning where the template is updated before any provisioning action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v2.18.0 Issues targeting release v2.18.0
Projects
None yet
4 participants