Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
We run "Breaking Change Detection" gates during software deployment to prevent last minute undetected breaking changes. However on occasion we get a false positive [TMPRL1100] nondeterministic workflow: extra replay command for a perfectly valid worker / workflow.
To detect breaking changes we
Why
On investigation the errors occur on workflows that execute a workflow task around the same time (within milliseconds) as when we download the history. As a result we capture a "partial" or "truncated" history.
Activity execution history
To take activity execution as an example we expect the following events:
EVENT_TYPE_WORKFLOW_TASK_SCHEDULED
EVENT_TYPE_WORKFLOW_TASK_STARTED
[1]EVENT_TYPE_WORKFLOW_TASK_COMPLETED
EVENT_TYPE_ACTIVITY_TASK_SCHEDULED
[2]EVENT_TYPE_ACTIVITY_TASK_STARTED
EVENT_TYPE_ACTIVITY_TASK_COMPLETED
(see ref for more information)
If we capture the history at any point in between the workflow task start [1] and the activity task schedule [2] (not included), we'll end up with an extra replay command since replay will generate a
ScheduleActivityTask
command for that workflow Task with no correspondingEVENT_TYPE_ACTIVITY_TASK_SCHEDULED
event, even though the event is eventually recorded.And so in a sense the Replayer assumes that events [1]-[2] are atomic.
Solution
The proposed solution here is to trim scheduled/started/completed workflow tasks with no follow-up events, this guarantees the workflow history is in a safely replayable state. For this there are three approaches:
GetWorkflowHistory
(to be discussed with upstream)Open Question
Code change
The attached code change is a test that demonstrates triggering an extra replay command error for truncated histories..