Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replay partial history test #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

RamyElkest
Copy link
Owner

@RamyElkest RamyElkest commented Oct 8, 2024

Problem

We run "Breaking Change Detection" gates during software deployment to prevent last minute undetected breaking changes. However on occasion we get a false positive [TMPRL1100] nondeterministic workflow: extra replay command for a perfectly valid worker / workflow.

To detect breaking changes we

  1. Download the workflow history using GetWorkflowHistory from the target environment.
  2. Pass it to ReplayWorkflowHistory which returns an error if the replay fails.

Why

On investigation the errors occur on workflows that execute a workflow task around the same time (within milliseconds) as when we download the history. As a result we capture a "partial" or "truncated" history.

Activity execution history

To take activity execution as an example we expect the following events:

  • EVENT_TYPE_WORKFLOW_TASK_SCHEDULED
  • EVENT_TYPE_WORKFLOW_TASK_STARTED [1]
  • EVENT_TYPE_WORKFLOW_TASK_COMPLETED
  • EVENT_TYPE_ACTIVITY_TASK_SCHEDULED [2]
  • EVENT_TYPE_ACTIVITY_TASK_STARTED
  • EVENT_TYPE_ACTIVITY_TASK_COMPLETED
    (see ref for more information)

If we capture the history at any point in between the workflow task start [1] and the activity task schedule [2] (not included), we'll end up with an extra replay command since replay will generate a ScheduleActivityTask command for that workflow Task with no corresponding EVENT_TYPE_ACTIVITY_TASK_SCHEDULED event, even though the event is eventually recorded.

And so in a sense the Replayer assumes that events [1]-[2] are atomic.

Solution

The proposed solution here is to trim scheduled/started/completed workflow tasks with no follow-up events, this guarantees the workflow history is in a safely replayable state. For this there are three approaches:

  1. Trim the history in GetWorkflowHistory (to be discussed with upstream)
  2. Trim the history in our code before passing it to the Replayer
  3. Trim the history in the Replayer (to be discussed with upstream)

Open Question

  • How does the worker deal with this on replay, is it guaranteed to retrieve consistent event history?

Code change

The attached code change is a test that demonstrates triggering an extra replay command error for truncated histories..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant