Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design: Reconsider the semantics of cancel.Wrap #208

Closed
iwahbe opened this issue Mar 29, 2024 · 8 comments
Closed

Design: Reconsider the semantics of cancel.Wrap #208

iwahbe opened this issue Mar 29, 2024 · 8 comments
Assignees
Labels
1.0.0-blocker An issue that must be resolved before 1.0.0 kind/design An engineering design doc, usually part of an Epic resolution/by-design This issue won't be fixed because the functionality is working as designed

Comments

@iwahbe
Copy link
Member

iwahbe commented Mar 29, 2024

Hello!

  • Vote on this issue by adding a 👍 reaction
  • If you want to implement this feature, comment to let us know (we'll work with you on design, scheduling, etc.)

Issue details

Following up on the discussion from #207, it seems like the as-designed behavior of the cancel middleware is unintuitive/unappealing to some.

The current behavior as described by cancel's doc comment (PR pending) is:

// The `cancel` package provides a middle-ware that ties the Cancel gRPC call from Pulumi
// to Go's `context.Context` cancellation system.
//
// Wrapping a provider in `cancel.Wrap` ensures 2 things:
//
// 1. When a resource operation times out, the associated context is canceled.
//
// 2. When `Cancel` is called, all outstanding gRPC methods have their associated contexts
// canceled.
//
// A `cancel.Wrap`ed provider will still call the `Cancel` method on the underlying
// provider. If NotImplemented is returned, it will be swallowed.

This issue is a great place to comment on why the existing design should change.

Affected area/feature

@iwahbe iwahbe added kind/enhancement Improvements or new features needs-triage Needs attention from the triage team labels Mar 29, 2024
@iwahbe
Copy link
Member Author

iwahbe commented Apr 1, 2024

For context, the bridge completely ignores the Cancel method:

SDK:

// Cancel requests that the provider cancel all ongoing RPCs. For TF, this is a no-op.
func (p *Provider) Cancel(ctx context.Context, req *pbempty.Empty) (*pbempty.Empty, error) {
	return &pbempty.Empty{}, nil
}

PF:

// SignalCancellation asks all resource providers to gracefully shut down and abort any ongoing operations. Operation
// aborted in this way will return an error (e.g., `Update` and `Create` will either a creation error or an
// initialization error. SignalCancellation is advisory and non-blocking; it is up to the host to decide how long to
// wait after SignalCancellation is called before (e.g.) hard-closing any gRPC connection.
func (p *provider) SignalCancellationWithContext(_ context.Context) error {
	// Some improvements are possible here to gracefully shut down.
	return nil
}

@t0yv0
Copy link
Member

t0yv0 commented Apr 1, 2024

The only point of implementing cancel would be to make some operation atomic with respect to cancellation to reduce the likelihood of exposing "partial states" exposed by interrupting such an operation in flight. Can we think of any? The stateful operation I'm thinking about is creating or updating a resource. It can be non-atomic if for example the provider creates a bucket first and then applies tags. Does this apply to the command provider, does it benefit from graceful cancel of a command? I'm apriori doubtful if any of the providers can meaningfully "gracefully cancel" these and would appreciate concrete evidence that it's helpful before throwing effort at implementing Cancel better at the generic level.

@mjeffryes mjeffryes removed the needs-triage Needs attention from the triage team label Apr 1, 2024
@iwahbe
Copy link
Member Author

iwahbe commented Apr 3, 2024

The only point of implementing cancel would be to make some operation atomic with respect to cancellation to reduce the likelihood of exposing "partial states" exposed by interrupting such an operation in flight.

I don't agree here. Implementing Cancel at a resource level can be used in at least 2 ways. To either:

  • Make a resource operation atomic by reverting a partial change.
  • Record a partial change into the state so a subsequent pulumi up knows about the partial state.

I can't think of how either of these would apply to pulumi-command, but I do think they would apply to a traditional cloud provider. They do apply in the bridge.

does it benefit from graceful cancel of a command? I'm apriori doubtful if any of the providers can meaningfully "gracefully cancel" these and would appreciate concrete evidence that it's helpful before throwing effort at implementing Cancel better at the generic level.

Yes. The best way to illustrate this is to compare the behavior of pulumi-command and an alternate build with the cancel middleware removed:

The command provider benefits from handing Cancel at the generic level by shutting down quickly with correct error messages. When Cancel is sent the the command provider while executing a lengthy command (via C-c), the response is instant:

Do you want to perform this update? yes
Updating (dev)

View in Browser (Ctrl+O): https://app.pulumi.com/pulumi/dev-yaml/dev/updates/134

Loading policy packs...

     Type                      Name          Status                  Info
     pulumi:pulumi:Stack       dev-yaml-dev  **failed**              1 error
 +   └─ command:local:Command  sleep         **creating failed**     1 error

Diagnostics:
  command:local:Command (sleep):
    error: signal: killed: running "sleep 10":

  pulumi:pulumi:Stack (dev-yaml-dev):
    error: update failed

Resources:
    1 unchanged

Duration: 7s

The provider shuts down cleanly and quickly. Pulumi feels responsive.

When running against a build of pulumi-command that excludes the cancel middleware, the provider ignores the first C-c sent to it. This feels unresponsive. On the second C-c, the provider is killed:

Updating (dev)

View in Browser (Ctrl+O): https://app.pulumi.com/pulumi/dev-yaml/dev/updates/137

Loading policy packs...

     Type                      Name          Status      
     pulumi:pulumi:Stack       dev-yaml-dev              
 +   └─ command:local:Command  sleep         created     

Resources:
    1 unchanged

Duration: 8s

error: update cancelled

Pulumi feels less responsive because the first C-c was ignored. Because the resource can't communicate back any error, the user can't see that Cancel was called while sleep 10 was running. This feedback might be significant if the command being run was aws s3 cp src dst, where partial states are possible.

In addition, the pulumi up display is wrong: sleep wasn't created and pulumi up will correctly try to create it again when pulumi up is run again.


I think the command provider experience is a pretty strong argument that providers need to exit when Cancel is called, and the best way we can get that "for free" IMO is with the cancel middleware's current semantics.

If a provider wants to customize the way it handles Cancel, it can do so at a resource by resource level simply by handling the cancelation or timeout of its associated context.Context. I expect most providers will not handle Cancel at the resource level, and canceling a request will simply return with an exit, a significant improvement in responsiveness over default behavior.

@t0yv0
Copy link
Member

t0yv0 commented Apr 3, 2024

Pulumi experience is specifically designed so the first Ctrl+C is not killing instantly so that providers can gracefully terminate. The second Ctrl+C does kill immediately. I think I can find refs for this in the p/p interrupt handler somewhere.

So for a provider, instead of ignoring Ctrl-C you can:

  • a) kill the process in response to Cancel.
  • b) if Cancel can block, then block until there're no gRPC requests outstanding and return; these protects gRPC requests as atomic; I'd expect this is available on gRPC server level

Either strategy feels like something with very few LOC and no concurrent code to worry about. To justify something sophisticated I'd love to see some worked out examples. I don't know of the bridge capability to gracefully cancel partial creates. I'm not sure what you mean by partial state in the engine, you mean "create failed" state? Any references there?

@iwahbe
Copy link
Member Author

iwahbe commented Apr 5, 2024

Pulumi experience is specifically designed so the first Ctrl+C is not killing instantly so that providers can gracefully terminate. The second Ctrl+C does kill immediately. I think I can find refs for this in the p/p interrupt handler somewhere.

That is my understanding. That means that the provider gets exactly one Cancel to respond to, since it can't respond to the second Ctrl+C.

  • a) kill the process in response to Cancel.
  • b) if Cancel can block, then block until there're no gRPC requests outstanding and return; these protects gRPC requests as atomic; I'd expect this is available on gRPC server level

Killing the process yourself doesn't help much, and Cancel can't block. Instead providers need to shut down gracefully as fast as possible (before a user sends the second Ctrl+C).

Coincidentally, a motivating example came up during hack week: #210. pulumi-pulumiservice has a non-atomic create operation for Team, so it needs to handle created but uninitialized resources. Not responding instantly to Cancel increases the odds that we will leak a Team in the service, which we will always do if the provider is killed after the team has been created but before the Create call returns.

@t0yv0
Copy link
Member

t0yv0 commented Apr 5, 2024

Right, right I'm with you that killing the provider is not always great answer, but if you're saying cancel cannot block, what are you saying will happen in the service provider in that scenario? Are you saying that Cancel handler will issue cancellation over context.Context object which will make Team.Create abort work but respond to the engine with "InitFailed" so that can be saved in state? I mean this:

// 2. When `Cancel` is called, all outstanding gRPC methods have their associated contexts
// canceled.

I haven't seen the code referenced that actually responds to the cancelled context, but skimming the docs just now it might be implicit in lower level networking code... So any kind of request handlers will stop waiting.

So this makes sense if we have some form of error recovery, like saving errors into state with &rpc.ErrorResourceInitFailed (which I still would love a quick page on what that does precisely). If you want to speed up this error recovery this makes total sense. If it was just injecting unrecoverable errors it didn't make sense to me as it seemed to be swapping one way to crash for another way to crash.

@iwahbe
Copy link
Member Author

iwahbe commented Apr 5, 2024

Are you saying that Cancel handler will issue cancellation over context.Context object which will make Team.Create abort work but respond to the engine with "InitFailed" so that can be saved in state?

Yes! That's exactly what I'm saying.

I haven't seen the code referenced that actually responds to the cancelled context, but skimming the docs just now it might be implicit in lower level networking code... So any kind of request handlers will stop waiting.

Yes. Any high quality library with ongoing operations should cancel immediately. This includes go's standard library, for both network requests and subprocesses. Most providers (pulumi-command and pulumi-pulumiservice included) will get instant cancelation for free.

@iwahbe iwahbe added kind/design An engineering design doc, usually part of an Epic and removed kind/enhancement Improvements or new features labels May 23, 2024
@iwahbe iwahbe added the 1.0.0-blocker An issue that must be resolved before 1.0.0 label Aug 20, 2024
@iwahbe iwahbe added the resolution/by-design This issue won't be fixed because the functionality is working as designed label Sep 23, 2024
@iwahbe iwahbe self-assigned this Sep 23, 2024
@iwahbe
Copy link
Member Author

iwahbe commented Sep 23, 2024

This hasn't come up again, and I think the design is pretty solid.

I definitely think that reducing the scope of the cancel wrapper would hurt the default experience. Until and unless the Pulumi provider protocol allows providers to block a cancel operation, we need to default to the safest Cancel behavior, which is graceful termination of all ongoing tasks.

@iwahbe iwahbe closed this as completed Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0.0-blocker An issue that must be resolved before 1.0.0 kind/design An engineering design doc, usually part of an Epic resolution/by-design This issue won't be fixed because the functionality is working as designed
Projects
None yet
Development

No branches or pull requests

3 participants