Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cancelation panic #207

Merged
merged 7 commits into from
Apr 1, 2024
Merged

Fix cancelation panic #207

merged 7 commits into from
Apr 1, 2024

Conversation

iwahbe
Copy link
Member

@iwahbe iwahbe commented Mar 28, 2024

Fixes #186

This PR is best reviewed by commit:

  • 0ecbbcd is a pure refactor to improve the readability of the cancel.Wrap method.
  • 997d21b Replaces inOutCache with evict.Pool: A data structure better suited for purpose. cancel uses evict.Pool to keep track of the set of current requests, ensuring that the request associated with each context is canceled if one of three events happen:
    • The request returns.
    • The request times out (only for Create, Update and Delete).
    • The provider receives a Cancel request.
  • 6d6b8aa improves the integration testing around the cancel middleware, especially under heavy concurrency. I'm pleased to report that 6d6b8aa without 997d21b regularly panics, showing that the improved testing does cover Cancellation panic #186.
  • 9a69c1f sets the -race flag for tests. This would have also caught Cancellation panic #186.

@iwahbe iwahbe requested review from blampe and a team March 28, 2024 14:46
@iwahbe iwahbe self-assigned this Mar 28, 2024
@blampe
Copy link
Contributor

blampe commented Mar 28, 2024

I'm still having trouble understanding why we need concurrency management here. Each request is handled in its own goroutine with its own context, so what is there to coordinate? I understand why we need some middleware to translate context.Context into provider.Context (#159), but it seems like the ctx.WithDeadline() (or p.CtxWithTimeout) that we pass down should be sufficient?

The provider receives a Cancel request.

It seems like this is primarily why we're trying to coordinate things across requests? If I understand, we get this Cancel and then try to also cancel in-progress requests. Setting aside whether this is the right behavior (I don't think it is), calling Stop on the server (which already has a global view of in-flight requests) should accomplish the same goal. That is, each request would see a message on ctx.Done(). If that's not the behavior we see, then that points towards a bug somewhere in how we propagate context.

Re: behavior, I don't think the Cancel handler's default behavior should impact other requests. If it should, I think that's a decision (and implementation) best left to the provider author. I would expect something more like: the first ctrl-C invokes Cancel, the server starts a graceful stop, and a second ctrl-C will force close the client's connection (and therefore cancels everything outstanding on the server).

@iwahbe
Copy link
Member Author

iwahbe commented Mar 28, 2024

I'm still having trouble understanding why we need concurrency management here. Each request is handled in its own goroutine with its own context, so what is there to coordinate? I understand why we need some middleware to translate context.Context into provider.Context (#159), but it seems like the ctx.WithDeadline() (or p.CtxWithTimeout) that we pass down should be sufficient?

That sits at the root of the server stack, and is unrelated to this middleware block. #159 is still the right thing to do, but it doesn't factor in here.

The provider receives a Cancel request.

It seems like this is primarily why we're trying to coordinate things across requests?

Yes. That is the only reason we need to deal with concurrency here.

Calling GracefullStop blocks the server from accepting new connections, but doesn't do anything to terminate existing connections. GracefullStop doesn't cancel the context on existing runs. Calling Stop does cancel the contexts associated with gRPC connects, but it also kills those connections, preventing the provider from communicating any information back to the engine.

The documentation for Cancel says:

    // Cancel signals the provider to gracefully shut down and abort any ongoing resource operations.
    // Operations aborted in this way will return an error (e.g., `Update` and `Create` will either return a
    // creation error or an initialization error). Since Cancel is advisory and non-blocking, it is up
    // to the host to decide how long to wait after Cancel is called before (e.g.)
    // hard-closing any gRPC connection.
    rpc Cancel(google.protobuf.Empty) returns (google.protobuf.Empty) {}

To do this, we need a middle ground between Stop and GracefullStop: we need to communicate to each resource that it should report progress and exit without slamming shut the gRPC connection. AFAIK the convention in go to inform a process that it should clean up and go home is to cancel it's context.Context.

Re: behavior, I don't think the Cancel handler's default behavior should impact other requests. If it should, I think that's a decision (and implementation) best left to the provider author. I would expect something more like: the first ctrl-C invokes Cancel, the server starts a graceful stop, and a second ctrl-C will force close the client's connection (and therefore cancel everything outstanding on the server).

The first ctrl-c sends the Cancel request to to the provider. The second ctrl-c has the engine kill the provider. As a provider author, we just get the 1 cancel. As I read the description of the Cancel call, it seems pretty clear that it should impact ongoing requests.

@t0yv0
Copy link
Member

t0yv0 commented Mar 29, 2024

I tried reading this but I'm also not sure what behavior we want and would like to understand that before getting down to how it's implemented. If my review's needed here would appreciate a quick sync to figure this out. We have a slightly related grey are for me where bridged providers try to handle graceful termination, and this would be super helpful to clarify what is the desired behavior on handling graceful provider termination, and what is the idiomatic Go expression of that.

Naively I'm expecting something along the lines of the GRPC server handling everything necessary for Contexts that handlers receive, but then us needing to tie a request for graceful termination from the Engine to some for of message to the gRPC server that would instruct it to cancel out all existing context objects and wait for the requests to finish processing. I'd expect this all to be available at this layer. Not at the logical layer as here.

Comment on lines +15 to +26
// The `cancel` package provides a middle-ware that ties the Cancel gRPC call from Pulumi
// to Go's `context.Context` cancellation system.
//
// Wrapping a provider in `cancel.Wrap` ensures 2 things:
//
// 1. When a resource operation times out, the associated context is canceled.
//
// 2. When `Cancel` is called, all outstanding gRPC methods have their associated contexts
// canceled.
//
// A `cancel.Wrap`ed provider will still call the `Cancel` method on the underlying
// provider. If NotImplemented is returned, it will be swallowed.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some docs clarifying what cancel currently does. This behavior is unchanged before and after this PR. The PR only prevents panics on concurrent creations and cancelations.

Since this PR fixes a P1 (panic) and doesn't change the design of cancel.Wrap, I suggest that we merge as is unless there are suggestions on how to better implement the existing semantic.

I can tell that there is disagreement on the desired behavior of cancel.Wrap. I'm happy to have that discussion, but I think it should be a separate discussion, outside the scope of this PR.

CC @blampe @t0yv0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #208 to track a design discussion on what cancel should do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!

@blampe
Copy link
Contributor

blampe commented Mar 29, 2024

Yes. That is the only reason we need to deal with concurrency here.

Calling Stop does cancel the contexts associated with gRPC connects, but it also kills those connections, preventing the provider from communicating any information back to the engine.

Gotcha -- this is the crux of it. You want to cancel the requests but still let them log or return to the engine.

The documentation for Cancel says:

    // Cancel signals the provider to gracefully shut down and abort any ongoing resource operations.
    // Operations aborted in this way will return an error (e.g., `Update` and `Create` will either return a
    // creation error or an initialization error). Since Cancel is advisory and non-blocking, it is up
    // to the host to decide how long to wait after Cancel is called before (e.g.)
    // hard-closing any gRPC connection.
    rpc Cancel(google.protobuf.Empty) returns (google.protobuf.Empty) {}

To do this, we need a middle ground between Stop and GracefullStop: we need to communicate to each resource that it should report progress and exit without slamming shut the gRPC connection. AFAIK the convention in go to inform a process that it should clean up and go home is to cancel it's context.Context.

The first ctrl-c sends the Cancel request to to the provider. The second ctrl-c has the engine kill the provider. As a provider author, we just get the 1 cancel. As I read the description of the Cancel call, it seems pretty clear that it should impact ongoing requests.

Interesting, I guess I interpreted this differently. The signal/advisory parts makes it seem like more of a courtesy call, and the "Operations aborted in this way..." explains how the engine will interpret operations aborted by the provider (if it decides to abort anything).

In any case this is probably one of those under-specified corners of behavior. I'm curious what we do in the bridge -- if this is already the norm then I don't have any concerns.

Naively I'm expecting something along the lines of the GRPC server handling everything necessary for Contexts that handlers receive, but then us needing to tie a request for graceful termination from the Engine to some for of message to the gRPC server that would instruct it to cancel out all existing context objects and wait for the requests to finish processing. I'd expect this all to be available at this layer. Not at the logical layer as here.

Same. This feels like it could be handled by an interceptor which the server can broadcast a shutting-down signal to.

This is accomplished by replacing a linear search of `c.entries` for empty cells with a
maintained list of empty cells.
@iwahbe iwahbe merged commit e49130e into main Apr 1, 2024
6 checks passed
@iwahbe iwahbe deleted the iwahbe/186/fix-cancelation-panic branch April 1, 2024 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cancellation panic
3 participants