Fix State machine sub-regions do not resume from last state after restore from persistence #811 #998

ShvetsovDV · 2021-07-31T16:11:52Z

Fixes #811

I think, for reason this problem, we need to fix two bags in AbstractPersistingStateMachineInterceptor:

definition of real "childRefs" for orthogonal states
Here we need to look at the line 171. I'am think condition "stateMachine.getState().isOrthogonal()" is unnecessary.
definition of real identifier of state machine for context
The fact is that all changes are saved for the main state machine(with a main state machine id). We need to fix it.

And second part of fix it is edit resetStateMachine in AbstractStateMachine

chutch · 2021-09-13T08:17:45Z

Hey @ShvetsovDV , thanks for looking at it.

I think there are actually 3 different bugs that are contributing to this issue and needs to be addressed. I'm lately looking into this issue on my project and would like to help to push those fixes through. Maybe we can do it one by one?

Resuming

DefaultStateMachineService reset on all regions instead of top one

You did fix it already here. Going through all regions is wrong, because the reset is anyway recursive.

Regions are being overwritten upon reset.

Fixed here. So all regions we iterate over are going to be always overwritten by last context.

Persisting

There is also an issue with persisting the context of regions. The context of region overwrites the main context.

Now, I'm not sure if the solution proposed is the correct one. Let me explain what I have observed. I have implemented my own StateMachineRuntimePersister, which also implements StateMachineInterceptor interface. There are 2 methods of our interest there:

void preStateChange(State<S, E> state, Message<E> message, Transition<S, E> transition,
			StateMachine<S, E> stateMachine, StateMachine<S, E> rootStateMachine);

	
void postStateChange(State<S, E> state, Message<E> message, Transition<S, E> transition,
			StateMachine<S, E> stateMachine, StateMachine<S, E> rootStateMachine);

The OOTB AbstractPersistingStateMachineInterceptor implementation persists mostly in preStateChange method. What I observed is that for the postStateChange method, stateMachine and rootStateMachine for regions are always correct and different objects.

However, for preStateChange the stateMachine and rootStateMachine are always (or almost always, I don't recall) the same object. Meaning there is a bug imho somewhere upstream.

Now the "Persisting" issues can be quickly fixed by using custom persister and persisting on postStateChange only. However Resuming issues need to be fixed in the library/project itself.

So, I have two questions... do you think we can split the fixes for persisting & resuming? How can I help to push these things through? I can gladly take over.

ShvetsovDV · 2021-09-14T07:34:38Z

@chutch, thanks for the answer!

Indeed, when solving the problem, several errors were fixed in different parts of the code. I think we need to commit the edits with one commit, otherwise there will be no positive effect.

I would also like to receive a comment on my commits from @jvalkeal .

However, for preStateChange the stateMachine and rootStateMachine are always (or almost always, I don't recall) the same object. Meaning there is a bug imho somewhere upstream.

Here fixed a bug in determining the machine ID for the context before saving it. Perhaps your effects are due to the fact that buildStateMachineContext is called not only in preStateChange and postStateChange.

I would be glad to receive examples of cases where my solution does not work.

chutch

OK, I'll try to rephrase to explain my concern. Let's assume that we have following simple State Machine:

            ┌─────►X1─────►X2────┐
            │                    │
start ───►A─┤                   ┌┘►B────►Z
            │                   │
            └─────►Y1─────►Y2───┘

Where:

start, A, B, Z are states of the root State Machine, with id example
X1, X2 are states of a sub-machine, a region called x therefore id example#x
Y1, Y2 is state of another sub-machine, a region called y

Now, let's assume state change from X1 -> X2. From the debugger I see that when postStateChange & preStateChange are called for the same state change, they are called with different parameters

preStateChange

public void preStateChange(State<S, E> state, Message<E> message, Transition<S, E> transition, StateMachine<S, E> stateMachine, StateMachine<S, E> rootStateMachine) {

// with "values"
state=X2
stateMachine=example
rootStateMachine=example

// the very same SM object, root one
stateMachine == rootStateMachine

postStateChange

public void postStateChange(State<S, E> state, Message<E> message, Transition<S, E> transition, StateMachine<S, E> stateMachine, StateMachine<S, E> rootStateMachine) {

// with "values"
state=X2
stateMachine=example#x  <== correct SM, because the state change happened within a region
rootStateMachine=example

// different SM objects
stateMachine != rootStateMachine

That makes me think, that maybe trying to fix this issue by "reverse-engineering" with findSmIdByRegion is not a proper way to fix it.

For some reason for the very same state change postStateChange method receives proper SM objects. Then passing those 2 objects to buildStateMachineContext() methods works perfectly fine without any changes in the code. So persisting on postStateChange seems to be working fine.

Whereas preStateChange receive 2x root SM, the very same object, the very same reference.

So this makes me think that maybe the fix should be performed somewhere deeper? Please mind this is just a hunch and maybe I am wrong. I think I would a lot more time to understand the inner workings of SM to see where the bug is. I am sure your code is working and doing the job. I am just wondering if it is not going to seal a deeper issue.

Let me know what you think about it. Hope this makes sense.

chutch · 2021-09-14T15:24:12Z

...java/org/springframework/statemachine/persist/AbstractPersistingStateMachineInterceptor.java

@@ -168,14 +159,14 @@ public void setExtendedStateVariablesFunction(
 		if (state.isSubmachineState()) {
 			id = getDeepState(state);
 		} else if (state.isOrthogonal()) {
-			if (stateMachine.getState().isOrthogonal()) {
+			//if (stateMachine.getState().isOrthogonal()) {


I don't understand this change, can you explain?

In this part of the code, it is important for us to write the child state machine identifiers into the parent context.

When we have a state machine with nested state machines, we need to get a context like this:

DefaultStateMachineContext [ id=testid , childs= [ DefaultStateMachineContext [ id=testid#FIRST , childs=[] , childRefs=[] , state=S21 , historyStates={} , event=E3 , eventHeaders={id=9ab47504-7da3-1853-8a2d-4a7e667a2def, timestamp=1633085509033} , extendedState=DefaultExtendedState [variables={}]] , DefaultStateMachineContext [ id=testid#SECOND , childs=[] , childRefs=[] , state=S31 , historyStates={} , event=E1 , eventHeaders={id=9638028e-a69e-4adf-92d4-bb4655f4c2a2, timestamp=1633085509039} , extendedState=DefaultExtendedState [variables={}]]] , childRefs=[] , state=S2 , historyStates={} , event=null , eventHeaders=null , extendedState=DefaultExtendedState [variables={}]]

This context for the parent statemachine must include child identifiers (in our example, id = testid # FIRST and id = testid # SECOND). Using these identifiers, we will restore the context from the database (from the "state" table by the "machine_id" field) for the child state machines.

In the line of code } else if (state.isOrthogonal ()) { we determine if our target state is ramified.

In the line of code if (stateMachine.getState (). IsOrthogonal ()) { we determine if our source state is ramified.

The fact that the source state and the target state at the same time together looks illogical and looks like a configuration error for the statemachine.

Due to the line if (stateMachine.getState (). IsOrthogonal ()) { child state machine IDs are not saved in the parent context.

For example, consider the testPersistRegionsAndRestore test.

ShvetsovDV · 2021-10-01T12:40:10Z

OK, I'll try to rephrase to explain my concern. Let's assume that we have following simple State Machine:
            ┌─────►X1─────►X2────┐
            │                    │
start ───►A─┤                   ┌┘►B────►Z
            │                   │
            └─────►Y1─────►Y2───┘
Where:

start, A, B, Z are states of the root State Machine, with id example

X1, X2 are states of a sub-machine, a region called x therefore id example#x

Y1, Y2 is state of another sub-machine, a region called y

Now, let's assume state change from X1 -> X2. From the debugger I see that when postStateChange & preStateChange are called for the same state change, they are called with different parameters

preStateChange
public void preStateChange(State<S, E> state, Message<E> message, Transition<S, E> transition, StateMachine<S, E> stateMachine, StateMachine<S, E> rootStateMachine) {

// with "values"
state=X2
stateMachine=example
rootStateMachine=example

// the very same SM object, root one
stateMachine == rootStateMachine 
postStateChange
public void postStateChange(State<S, E> state, Message<E> message, Transition<S, E> transition, StateMachine<S, E> stateMachine, StateMachine<S, E> rootStateMachine) {

// with "values"
state=X2
stateMachine=example#x  <== correct SM, because the state change happened within a region
rootStateMachine=example

// different SM objects
stateMachine != rootStateMachine 
That makes me think, that maybe trying to fix this issue by "reverse-engineering" with findSmIdByRegion is not a proper way to fix it.

For some reason for the very same state change postStateChange method receives proper SM objects. Then passing those 2 objects to buildStateMachineContext() methods works perfectly fine without any changes in the code. So persisting on postStateChange seems to be working fine.

Whereas preStateChange receive 2x root SM, the very same object, the very same reference.

So this makes me think that maybe the fix should be performed somewhere deeper? Please mind this is just a hunch and maybe I am wrong. I think I would a lot more time to understand the inner workings of SM to see where the bug is. I am sure your code is working and doing the job. I am just wondering if it is not going to seal a deeper issue.

Let me know what you think about it. Hope this makes sense.

You are right that the error goes somewhere deeper. But the source code is difficult to write and needs a serious rewrite. Such elaboration takes time. To carry out serious rewrite without coordination and explanation of certain decisions by the owner of the source code, is fraught with new errors and loss of time.

In my implementation, I am locally parsing the structure of the statemachine in the findSmIdByRegion method. This fix will not introduce new bugs, although it looks pretty crutch.

ShvetsovDV · 2021-10-01T12:47:52Z

In a serious revision, to simplify the code, I would like methods for analyzing the structure of the state machine similar to findSmIdByRegion (may be findNextState, getAvailableStates, findEventsFromSourceState, ...) into a separate static class.

kiran-mallineni · 2024-07-25T20:57:29Z

Hey @ShvetsovDV , thanks for looking at it.

I think there are actually 3 different bugs that are contributing to this issue and needs to be addressed. I'm lately looking into this issue on my project and would like to help to push those fixes through. Maybe we can do it one by one?

Resuming

DefaultStateMachineService reset on all regions instead of top one

You did fix it already here. Going through all regions is wrong, because the reset is anyway recursive.

Regions are being overwritten upon reset.

Fixed here. So all regions we iterate over are going to be always overwritten by last context.

Persisting

There is also an issue with persisting the context of regions. The context of region overwrites the main context.

Now, I'm not sure if the solution proposed is the correct one. Let me explain what I have observed. I have implemented my own StateMachineRuntimePersister, which also implements StateMachineInterceptor interface. There are 2 methods of our interest there:
void preStateChange(State<S, E> state, Message<E> message, Transition<S, E> transition,
			StateMachine<S, E> stateMachine, StateMachine<S, E> rootStateMachine);

	
void postStateChange(State<S, E> state, Message<E> message, Transition<S, E> transition,
			StateMachine<S, E> stateMachine, StateMachine<S, E> rootStateMachine);
The OOTB AbstractPersistingStateMachineInterceptor implementation persists mostly in preStateChange method. What I observed is that for the postStateChange method, stateMachine and rootStateMachine for regions are always correct and different objects.

However, for preStateChange the stateMachine and rootStateMachine are always (or almost always, I don't recall) the same object. Meaning there is a bug imho somewhere upstream.

Now the "Persisting" issues can be quickly fixed by using custom persister and persisting on postStateChange only. However Resuming issues need to be fixed in the library/project itself.

So, I have two questions... do you think we can split the fixes for persisting & resuming? How can I help to push these things through? I can gladly take over.

yes there is a bug AbstractStateMachine which is calling AbstractPersistingStateMachineInterceptor with incorrect state machine. can we prioritise this to fix in the library? we fixed some parts by extending the library. however AbstractStateMachine is so complicated as there are no proper extension points

ShvetsovDV added 3 commits July 22, 2021 17:09

fix issue 811

4837d83

fix issue 811 part two (correct apply restored state context)

e813399

fix issue 811 add test

2002647

ShvetsovDV changed the title ~~Sec 811~~ Fix Sec 811 State machine sub-regions do not resume from last state after restore from persistence #811 Aug 1, 2021

ShvetsovDV changed the title ~~Fix Sec 811 State machine sub-regions do not resume from last state after restore from persistence #811~~ Fix State machine sub-regions do not resume from last state after restore from persistence #811 Aug 1, 2021

chutch reviewed Sep 14, 2021

View reviewed changes

chutch mentioned this pull request Sep 15, 2021

DefaultStateMachineService does not restore state properly #959

Open

chrisgibson41 approved these changes Sep 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix State machine sub-regions do not resume from last state after restore from persistence #811 #998

Fix State machine sub-regions do not resume from last state after restore from persistence #811 #998

ShvetsovDV commented Jul 31, 2021 •

edited

Loading

chutch commented Sep 13, 2021 •

edited

Loading

ShvetsovDV commented Sep 14, 2021 •

edited

Loading

chutch left a comment •

edited

Loading

chutch Sep 14, 2021

ShvetsovDV Oct 1, 2021

ShvetsovDV commented Oct 1, 2021

preStateChange

postStateChange

ShvetsovDV commented Oct 1, 2021

kiran-mallineni commented Jul 25, 2024

Resuming

Persisting

Fix State machine sub-regions do not resume from last state after restore from persistence #811 #998

Are you sure you want to change the base?

Fix State machine sub-regions do not resume from last state after restore from persistence #811 #998

Conversation

ShvetsovDV commented Jul 31, 2021 • edited Loading

chutch commented Sep 13, 2021 • edited Loading

Resuming

Persisting

ShvetsovDV commented Sep 14, 2021 • edited Loading

chutch left a comment • edited Loading

Choose a reason for hiding this comment

preStateChange

postStateChange

chutch Sep 14, 2021

Choose a reason for hiding this comment

ShvetsovDV Oct 1, 2021

Choose a reason for hiding this comment

ShvetsovDV commented Oct 1, 2021

preStateChange

postStateChange

ShvetsovDV commented Oct 1, 2021

kiran-mallineni commented Jul 25, 2024

Resuming

Persisting

ShvetsovDV commented Jul 31, 2021 •

edited

Loading

chutch commented Sep 13, 2021 •

edited

Loading

ShvetsovDV commented Sep 14, 2021 •

edited

Loading

chutch left a comment •

edited

Loading