Fix cancelation panic #207

iwahbe · 2024-03-28T14:46:33Z

Fixes #186

This PR is best reviewed by commit:

0ecbbcd is a pure refactor to improve the readability of the cancel.Wrap method.
997d21b Replaces inOutCache with evict.Pool: A data structure better suited for purpose. cancel uses evict.Pool to keep track of the set of current requests, ensuring that the request associated with each context is canceled if one of three events happen:
- The request returns.
- The request times out (only for Create, Update and Delete).
- The provider receives a Cancel request.
6d6b8aa improves the integration testing around the cancel middleware, especially under heavy concurrency. I'm pleased to report that 6d6b8aa without 997d21b regularly panics, showing that the improved testing does cover Cancellation panic #186.
9a69c1f sets the -race flag for tests. This would have also caught Cancellation panic #186.

The new implementation is simpler to understand and consequentially less racy.

blampe · 2024-03-28T17:13:19Z

I'm still having trouble understanding why we need concurrency management here. Each request is handled in its own goroutine with its own context, so what is there to coordinate? I understand why we need some middleware to translate context.Context into provider.Context (#159), but it seems like the ctx.WithDeadline() (or p.CtxWithTimeout) that we pass down should be sufficient?

The provider receives a Cancel request.

It seems like this is primarily why we're trying to coordinate things across requests? If I understand, we get this Cancel and then try to also cancel in-progress requests. Setting aside whether this is the right behavior (I don't think it is), calling Stop on the server (which already has a global view of in-flight requests) should accomplish the same goal. That is, each request would see a message on ctx.Done(). If that's not the behavior we see, then that points towards a bug somewhere in how we propagate context.

Re: behavior, I don't think the Cancel handler's default behavior should impact other requests. If it should, I think that's a decision (and implementation) best left to the provider author. I would expect something more like: the first ctrl-C invokes Cancel, the server starts a graceful stop, and a second ctrl-C will force close the client's connection (and therefore cancels everything outstanding on the server).

iwahbe · 2024-03-28T17:41:51Z

I'm still having trouble understanding why we need concurrency management here. Each request is handled in its own goroutine with its own context, so what is there to coordinate? I understand why we need some middleware to translate context.Context into provider.Context (#159), but it seems like the ctx.WithDeadline() (or p.CtxWithTimeout) that we pass down should be sufficient?

That sits at the root of the server stack, and is unrelated to this middleware block. #159 is still the right thing to do, but it doesn't factor in here.

The provider receives a Cancel request.

It seems like this is primarily why we're trying to coordinate things across requests?

Yes. That is the only reason we need to deal with concurrency here.

Calling GracefullStop blocks the server from accepting new connections, but doesn't do anything to terminate existing connections. GracefullStop doesn't cancel the context on existing runs. Calling Stop does cancel the contexts associated with gRPC connects, but it also kills those connections, preventing the provider from communicating any information back to the engine.

The documentation for Cancel says:

    // Cancel signals the provider to gracefully shut down and abort any ongoing resource operations.
    // Operations aborted in this way will return an error (e.g., `Update` and `Create` will either return a
    // creation error or an initialization error). Since Cancel is advisory and non-blocking, it is up
    // to the host to decide how long to wait after Cancel is called before (e.g.)
    // hard-closing any gRPC connection.
    rpc Cancel(google.protobuf.Empty) returns (google.protobuf.Empty) {}

To do this, we need a middle ground between Stop and GracefullStop: we need to communicate to each resource that it should report progress and exit without slamming shut the gRPC connection. AFAIK the convention in go to inform a process that it should clean up and go home is to cancel it's context.Context.

Re: behavior, I don't think the Cancel handler's default behavior should impact other requests. If it should, I think that's a decision (and implementation) best left to the provider author. I would expect something more like: the first ctrl-C invokes Cancel, the server starts a graceful stop, and a second ctrl-C will force close the client's connection (and therefore cancel everything outstanding on the server).

The first ctrl-c sends the Cancel request to to the provider. The second ctrl-c has the engine kill the provider. As a provider author, we just get the 1 cancel. As I read the description of the Cancel call, it seems pretty clear that it should impact ongoing requests.

t0yv0 · 2024-03-29T17:18:01Z

I tried reading this but I'm also not sure what behavior we want and would like to understand that before getting down to how it's implemented. If my review's needed here would appreciate a quick sync to figure this out. We have a slightly related grey are for me where bridged providers try to handle graceful termination, and this would be super helpful to clarify what is the desired behavior on handling graceful provider termination, and what is the idiomatic Go expression of that.

Naively I'm expecting something along the lines of the GRPC server handling everything necessary for Contexts that handlers receive, but then us needing to tie a request for graceful termination from the Engine to some for of message to the gRPC server that would instruct it to cancel out all existing context objects and wait for the requests to finish processing. I'd expect this all to be available at this layer. Not at the logical layer as here.

iwahbe · 2024-03-29T22:10:11Z

middleware/cancel/cancel.go

+// The `cancel` package provides a middle-ware that ties the Cancel gRPC call from Pulumi
+// to Go's `context.Context` cancellation system.
+//
+// Wrapping a provider in `cancel.Wrap` ensures 2 things:
+//
+// 1. When a resource operation times out, the associated context is canceled.
+//
+// 2. When `Cancel` is called, all outstanding gRPC methods have their associated contexts
+// canceled.
+//
+// A `cancel.Wrap`ed provider will still call the `Cancel` method on the underlying
+// provider. If NotImplemented is returned, it will be swallowed.


I added some docs clarifying what cancel currently does. This behavior is unchanged before and after this PR. The PR only prevents panics on concurrent creations and cancelations.

Since this PR fixes a P1 (panic) and doesn't change the design of cancel.Wrap, I suggest that we merge as is unless there are suggestions on how to better implement the existing semantic.

I can tell that there is disagreement on the desired behavior of cancel.Wrap. I'm happy to have that discussion, but I think it should be a separate discussion, outside the scope of this PR.

CC @blampe @t0yv0

I opened #208 to track a design discussion on what cancel should do.

Sounds good to me!

blampe · 2024-03-29T22:24:36Z

Yes. That is the only reason we need to deal with concurrency here.

Calling Stop does cancel the contexts associated with gRPC connects, but it also kills those connections, preventing the provider from communicating any information back to the engine.

Gotcha -- this is the crux of it. You want to cancel the requests but still let them log or return to the engine.

The documentation for Cancel says:
    // Cancel signals the provider to gracefully shut down and abort any ongoing resource operations.
    // Operations aborted in this way will return an error (e.g., `Update` and `Create` will either return a
    // creation error or an initialization error). Since Cancel is advisory and non-blocking, it is up
    // to the host to decide how long to wait after Cancel is called before (e.g.)
    // hard-closing any gRPC connection.
    rpc Cancel(google.protobuf.Empty) returns (google.protobuf.Empty) {}
To do this, we need a middle ground between Stop and GracefullStop: we need to communicate to each resource that it should report progress and exit without slamming shut the gRPC connection. AFAIK the convention in go to inform a process that it should clean up and go home is to cancel it's context.Context.

The first ctrl-c sends the Cancel request to to the provider. The second ctrl-c has the engine kill the provider. As a provider author, we just get the 1 cancel. As I read the description of the Cancel call, it seems pretty clear that it should impact ongoing requests.

Interesting, I guess I interpreted this differently. The signal/advisory parts makes it seem like more of a courtesy call, and the "Operations aborted in this way..." explains how the engine will interpret operations aborted by the provider (if it decides to abort anything).

In any case this is probably one of those under-specified corners of behavior. I'm curious what we do in the bridge -- if this is already the norm then I don't have any concerns.

Naively I'm expecting something along the lines of the GRPC server handling everything necessary for Contexts that handlers receive, but then us needing to tie a request for graceful termination from the Engine to some for of message to the gRPC server that would instruct it to cancel out all existing context objects and wait for the requests to finish processing. I'd expect this all to be available at this layer. Not at the logical layer as here.

Same. This feels like it could be handled by an interceptor which the server can broadcast a shutting-down signal to.

This is accomplished by replacing a linear search of `c.entries` for empty cells with a maintained list of empty cells.

iwahbe added 3 commits March 28, 2024 13:10

refactor: Improve readability of cancelation wrapper

0ecbbcd

Replace inOutCache with evict.Pool

997d21b

The new implementation is simpler to understand and consequentially less racy.

Improve testing of concurrent systems

6d6b8aa

iwahbe requested review from blampe and a team March 28, 2024 14:46

iwahbe self-assigned this Mar 28, 2024

iwahbe added 2 commits March 28, 2024 15:50

Set the -race flag for go tests

9a69c1f

Fix comment

7cf4c3e

Expand on what cancel does

55b0110

iwahbe commented Mar 29, 2024

View reviewed changes

iwahbe mentioned this pull request Mar 29, 2024

Design: Reconsider the semantics of cancel.Wrap #208

Closed

blampe approved these changes Mar 29, 2024

View reviewed changes

O(1) amortized insertion

d75230c

This is accomplished by replacing a linear search of `c.entries` for empty cells with a maintained list of empty cells.

iwahbe merged commit e49130e into main Apr 1, 2024
6 checks passed

iwahbe deleted the iwahbe/186/fix-cancelation-panic branch April 1, 2024 09:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cancelation panic #207

Fix cancelation panic #207

iwahbe commented Mar 28, 2024 •

edited

Loading

blampe commented Mar 28, 2024 •

edited

Loading

iwahbe commented Mar 28, 2024 •

edited

Loading

t0yv0 commented Mar 29, 2024

iwahbe Mar 29, 2024

iwahbe Mar 29, 2024

blampe Mar 29, 2024

blampe commented Mar 29, 2024

Fix cancelation panic #207

Fix cancelation panic #207

Conversation

iwahbe commented Mar 28, 2024 • edited Loading

blampe commented Mar 28, 2024 • edited Loading

iwahbe commented Mar 28, 2024 • edited Loading

t0yv0 commented Mar 29, 2024

iwahbe Mar 29, 2024

Choose a reason for hiding this comment

iwahbe Mar 29, 2024

Choose a reason for hiding this comment

blampe Mar 29, 2024

Choose a reason for hiding this comment

blampe commented Mar 29, 2024

iwahbe commented Mar 28, 2024 •

edited

Loading

blampe commented Mar 28, 2024 •

edited

Loading

iwahbe commented Mar 28, 2024 •

edited

Loading