Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goal status corruption when goal takes longer to finishes than ActionServer's result_timeout #970

Open
erikschuitema opened this issue Mar 15, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@erikschuitema
Copy link

Bug report

Required Info:

  • Operating System:
    • Ubuntu 20.04
  • Installation type:
    • binaries
  • Version or commit hash:
    • ros-foxy-rcl/focal,now 1.1.13-1focal.20220204.170046 amd64
  • DDS implementation:
    • CycloneDDS
  • Client library (if applicable):
    • rclpy

Steps to reproduce issue

  • Create an ActionServer with a certain result_timeout (default: 900s) and an execute_callback that calls goal_handle.succeed() after a period greater than result_timeout.
  • Create an ActionClient, send a goal and wait for the result, e.g., action_client.send_goal_async(goal).add_done_callback(_on_goal_response) with:
def _on_goal_response(future):
    goal_handle = future.result()
    if goal_handle.accepted:
        goal_handle.get_result_async().add_done_callback(_on_goal_result)


def _on_goal_result(future):
    status = future.result().status
    print(f"Goal result status: {status}")

  • Keep sending a new goal when the previous one finishes.

Expected behavior

All goal result status values are equal to action_msgs.msg.GoalStatus.STATUS_SUCCEEDED.

Actual behavior

Some goal status values are equal to action_msgs.msg.GoalStatus.STATUS_SUCCEEDED and some have an invalid value, e.g., -59 or -91.

Additional information

In rcl_action_expire_goals() (action_server.c), inactive goals are checked against their creation timestamp and considered expired if they are created longer than result_timeout ago. If this function is called directly after the goal succeeds, but before sending the result to the ActionClient, the goal might be considered as expired, its handle deallocated and bogus result data sent back to the ActionClient. Whether the function is called at this point or not seems to depend on the exact course of events and their timing.

I believe the intended behavior is to cache the goal result after the goal finishes (succeed(), abort(), canceled()) to give the client some time to retrieve the result.

Here is some debug log data of one of our production nodes - an ActionServer - to illustrate the course of events. I did not include all lines to avoid clutter, but I'd be happy to send the full log if so desired. Several times, STATUS_SUCCEEDED is returned (4) and sometimes invalid values (-59, -91):

[1647335999.015429264] [action_server_node]: Executing goal with ID [215 184 120 184 118 145  70 116 178  10  54 132 216  44 110 194]
[1647336013.871788332] [action_server_node]: Goal reached!
[1647336013.873899139] [rcl]: Updated timer period from '900000000000ns' to '885129670410ns'
[1647336013.878504394] [action_server_node]: Goal with ID [215 184 120 184 118 145  70 116 178  10  54 132 216  44 110 194] finished with state 4
[1647336432.312907051] [action_server_node]: Executing goal with ID [ 90 235 184 166 112 213  67 229 171  43  38   9  89 154  96  65]
[1647336913.873975199] [rcl]: Timer in wait set is ready
[1647337229.639817858] [action_server_node]: Goal reached!
[1647337229.641596384] [rcl]: Updated timer period from '885129670410ns' to '102661039871ns'
[1647337229.645905871] [action_server_node]: Goal with ID [ 90 235 184 166 112 213  67 229 171  43  38   9  89 154  96  65] finished with state 4
[1647337879.914539520] [action_server_node]: Executing goal with ID [  4 220 198 104 215  48  76  63 155   0 191  39  45  65  21  48]
[1647338114.771408273] [rcl]: Timer in wait set is ready
[1647338781.127302593] [action_server_node]: Goal reached!
[1647338781.128931152] [rcl]: Updated timer period from '102661039871ns' to '0ns'
[1647338781.133443158] [action_server_node]: Goal with ID [  4 220 198 104 215  48  76  63 155   0 191  39  45  65  21  48] finished with state 4
[1647338883.790117503] [rcl]: Timer in wait set is ready
[1647339426.485650098] [action_server_node]: Executing goal with ID [210  78 147 224 212 228  78 138 142  42 122  80 252 208 240 141]
[1647340502.939447051] [action_server_node]: Goal reached!
[1647340502.941234100] [rcl]: Updated timer period from '0ns' to '0ns'
[1647340502.942209879] [rcl]: Timer in wait set is ready
[1647340502.947943806] [action_server_node]: Goal with ID [210  78 147 224 212 228  78 138 142  42 122  80 252 208 240 141] finished with state -91
[1647341152.600432900] [action_server_node]: Executing goal with ID [ 70  12  15 101  50 107  69 114 157 101 137 177 149   8 239  36]
[1647342335.974547175] [action_server_node]: Goal reached!
[1647342335.976116271] [rcl]: Updated timer period from '0ns' to '0ns'
[1647342335.976885865] [rcl]: Timer in wait set is ready
[1647342335.983043025] [action_server_node]: Goal with ID [ 70  12  15 101  50 107  69 114 157 101 137 177 149   8 239  36] finished with state -59
[1647342922.961496135] [action_server_node]: Executing goal with ID [161 156  97 122 246 126  72 167 159 114  98  74 115 120   1 203]
[1647343358.867554586] [rcl]: Updated timer period from '0ns' to '464087590402ns'
[1647343358.868359240] [rcl]: Timer in wait set is ready
[1647343358.869276426] [rcl]: Updated timer period from '464087590402ns' to '464085861582ns'
[1647343358.873525096] [action_server_node]: Goal with ID [161 156  97 122 246 126  72 167 159 114  98  74 115 120   1 203] finished with state 5
[1647343705.921281623] [action_server_node]: Executing goal with ID [ 36  17 229  73  24 216  74 170 148  57 118 169  88 245  57 207]
[1647343822.957017623] [rcl]: Timer in wait set is ready
[1647344134.444931055] [action_server_node]: Goal reached!
[1647344134.446515204] [rcl]: Updated timer period from '464085861582ns' to '471466247058ns'
[1647344134.450559400] [action_server_node]: Goal with ID [ 36  17 229  73  24 216  74 170 148  57 118 169  88 245  57 207] finished with state 4
[1647344598.532509457] [rcl]: Timer in wait set is ready
[1647344598.533403778] [rcl]: Updated timer period from '471466247058ns' to '7379354687ns'
[1647345069.999721681] [rcl]: Timer in wait set is ready
[1647345902.897012440] [action_server_node]: Executing goal with ID [106 242 246 207 113 219  74  13 142  76 224 184 220 234 180 216]
[1647346828.269364764] [action_server_node]: Goal reached!
[1647346828.271057242] [rcl]: Updated timer period from '7379354687ns' to '0ns'
[1647346828.275420165] [action_server_node]: Goal with ID [106 242 246 207 113 219  74  13 142  76 224 184 220 234 180 216] finished with state 4
[1647346835.650637372] [rcl]: Timer in wait set is ready
[1647347243.613745355] [action_server_node]: Executing goal with ID [ 79 212 168 155  71  58  77 135 155  13 216 146 157 219  37 100]
[1647347451.555608172] [action_server_node]: Goal reached!
[1647347451.557238031] [rcl]: Updated timer period from '0ns' to '692046631481ns'
[1647347451.558081235] [rcl]: Timer in wait set is ready
[1647347451.559103617] [rcl]: Updated timer period from '692046631481ns' to '692044761724ns'
[1647347451.563838790] [action_server_node]: Goal with ID [ 79 212 168 155  71  58  77 135 155  13 216 146 157 219  37 100] finished with state 4
[1647348143.605808335] [rcl]: Timer in wait set is ready
[1647348392.123199612] [action_server_node]: Executing goal with ID [166 131  71 148 187 142  66 184 160 139  82 226  88  39  73  94]
[1647349471.575331372] [action_server_node]: Goal reached!
[1647349471.577114966] [rcl]: Updated timer period from '692044761724ns' to '0ns'
[1647349471.581504490] [action_server_node]: Goal with ID [166 131  71 148 187 142  66 184 160 139  82 226  88  39  73  94] finished with state 4
[1647350113.120376174] [action_server_node]: Executing goal with ID [239 228 125  68 244 214  78 184 128 218 235  19 192  41 125  88]
[1647350163.621988713] [rcl]: Timer in wait set is ready
[1647351415.236826018] [action_server_node]: Goal reached!
[1647351415.238423224] [rcl]: Updated timer period from '0ns' to '0ns'
[1647351415.239265006] [rcl]: Timer in wait set is ready
[1647351415.245248589] [action_server_node]: Goal with ID [239 228 125  68 244 214  78 184 128 218 235  19 192  41 125  88] finished with state -91

Whenever I observe Updated timer period from '0ns' to '0ns' after the goal is reached, an invalid state is returned.

@jacobperron jacobperron added the bug Something isn't working label Apr 26, 2022
@jacobperron
Copy link
Member

jacobperron commented Apr 26, 2022

Yes, I think the correct thing to do is to expire goal results based on the time the result is ready (not the goal creation time).

I think one way we could implement this would be to add a member to rcl_action_goal_handle_impl_s to record the result timestamp. We could then note the time when we are notified that a goal has reached a terminal state, e.g. change the signature of this function to include a goal UUID:

rcl_ret_t
rcl_action_notify_goal_done(
const rcl_action_server_t * action_server);

Changing the signature of rcl_action_notify_goal_done is not a backwards compatible fix, though I'm not sure we want to port a fix to an existing release anyways since the change in behavior could be disruptive enough.

@erikschuitema
Copy link
Author

At least the workaround is simple enough: pass a custom result_timeout when creating the ActionServer. Choose it larger than the expected largest time duration to reach a terminal goal state and to retrieve that result, and shorter than the time duration expected to cause memory issues when keeping too many goal results in memory (unlikely).

@erikschuitema
Copy link
Author

@jacobperron Without changing the timing behavior or function signatures, do you think it's possible to create and backport a fix for the memory issue that causes the invalid returned goal state? I was hoping that the goal handle could be properly cleaned up before sending the result value. This would result in a STATUS_UNKNOWN for an 'unknown'/expired goal I guess.
I'll admit I don't know whether this can be solved in rcl, or that rclpy is (also) causing it.

I'm asking because it took me quite some time to debug this issue. It's not trivial for unaware users that observing invalid goal states has something to do with the result_timeout, whether it's server or client related, etc.
It was also hard to reproduce at first, given the default timeout of 900s.

@jacobperron
Copy link
Member

Aside from the timing bug, there is still the race you pointed out. E.g. if we set the expiry time to zero (expire goals immediately), then I guess we'll run into the same invalid memory issue. IMO, this is a bug that should be fixed and backported.

I suspect it is something that needs to be fixed at the rclcpp / rclpy layer, but I don't know for sure without taking some more time to dig into the code.

@jacobperron jacobperron removed their assignment May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants