Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_instance_create_timeout_while_creating_zone is flaky #6197

Closed
bnaecker opened this issue Jul 31, 2024 · 12 comments · Fixed by #6238
Closed

test_instance_create_timeout_while_creating_zone is flaky #6197

bnaecker opened this issue Jul 31, 2024 · 12 comments · Fixed by #6238
Assignees
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@bnaecker
Copy link
Collaborator

The test test_instance_create_timeout_while_creating_zone attempts to verify what happens when zone-boot times out while creating an instance. Prior to b74d691, that was inadvertently relying on a DNS resolution request that would never succeed to "time out" the zone boot. In that commit, we removed that DNS resolution, revealing the problem in the test.

However, the test remains flaky. It's relying on Tokio's time facilities to pause / advance time far enough that the we time out the request to create an instance, but not far enough that the zone actually boots. That kind of control of time is always finicky, since it's often the case that time needs to be advanced in small jumps to allow the runtime to schedule and drive other tasks in the meantime. In this case, we can apparently just never call the MockZones::boot() method, presumably because the test succeeds and that task is torn down before it's called.

We should probably make the test work better by advancing time in small jumps, well beyond the expected timeout for the zone boot call itself.

Originally posted by @hawkw in #5749 (comment)

@bnaecker bnaecker self-assigned this Jul 31, 2024
@bnaecker
Copy link
Collaborator Author

bnaecker commented Aug 2, 2024

I've been looking at this again, because I'm touching these tests for something unrelated. I think it's not possible to do what we want in this test.

The intent behind the test is this:

  1. Make a mock expectation, with a synchronous closure that waits for a long time, call it boot_timeout.
  2. Pause Tokio time
  3. Try to set an instance running
  4. Apply a timeout to that future, that's less than boot_timeout, call it instance_timeout.
  5. Step time beyond instance_timeout, but not beyond boot_timeout.
  6. Observe that the timeout actually does time out

The problem is that this bridges async and sync code. The mockall closure must be synchronous. There are mechanisms to wait for events or timeouts in synchronous code, such as a std channel or just std::thread::sleep(). However! We're also trying to step time in the test body itself, and drive the test to completion.

If we call the mock closure at all before the test completes, it will block until whatever mechanism we use completes. That is a synchronous closure, and without an await point there is no way to yield back to the Tokio runtime to drive the rest of the tasks to completion. So what we want is a way to delay completing that closure, but that's not possible without indefinitely delaying the runtime itself.

@sunshowers sunshowers added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Aug 3, 2024
@sunshowers
Copy link
Contributor

sunshowers commented Aug 5, 2024

This seems to be pretty common in practice. Would it be worth adding retries to it at the nextest level (if the fix may take a while to develop)?

@bnaecker
Copy link
Collaborator Author

bnaecker commented Aug 5, 2024

Unless I'm missing something, I don't believe it's possible to write this test the way we want. I don't see how one can block the mock zone-boot method, without also blocking the single-threaded Tokio runtime. I.e., the test just needs to take the full time we are pretending zone boot takes (60s), or we fundamentally restructure the test or remove it. I personally feel the latter is the most expedient option, but I'd love to learn how we can keep the test.

@sunshowers
Copy link
Contributor

sunshowers commented Aug 5, 2024

Hmm, could you do something like injecting a oneshot channel to wait on? (Also might be worth using a multithreaded tokio runtime for this test -- we do it for a few others)

@sunshowers
Copy link
Contributor

Hmm, a multithreaded runtime does mean we lose the ability to use tokio's pause/resume time events.

I guess the only real way to do this would be to pass the source of time down as a dependency. If that's too much work, it's probably just worth removing the test.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Aug 5, 2024

I guess the only real way to do this would be to pass the source of time down as a dependency

I think you mean something like: passing in a start time or deadline into the MockZones::boot() closure, and waiting until that's reached, right? If so, I tried that in a previous iteration, and I don't think that can work. That closure has no way to wait for anything that does not also block the Tokio runtime, since it's synchronous. So, if that closure is called at all, it will deadlock the test, since the "outer" code advancing the timer will never run.

If that closure does not wait, well then there's really no way for us to make booting appear slow! And since that's the whole point of the test, I feel like it lowers the value of the test.

@sunshowers Am I right in my assessment of the closure? Since it's sync, and we're controlling time in a single-threaded runtime, there's no way for it "appear slow" without blocking the test entirely.

@sunshowers
Copy link
Contributor

sunshowers commented Aug 6, 2024

I think you mean something like: passing in a start time or deadline into the MockZones::boot() closure, and waiting until that's reached, right? If so, I tried that in a previous iteration, and I don't think that can work. That closure has no way to wait for anything that does not also block the Tokio runtime, since it's synchronous. So, if that closure is called at all, it will deadlock the test, since the "outer" code advancing the timer will never run.

That's right in the context of a single-threaded runtime -- but using a multithreaded runtime should (I think) be able to work around that.

Let's say you somehow arranged for MockZones::boot method to have access to a tokio::sync::oneshot::Receiver (e.g. via a task-local variable, or passing it in somehow -- not quite sure how mockall works).

Then, in MockZones::boot, you could wait on it with blocking_recv.

And in the test itself, you could spin up a task which sends a message via the channel. With a multithreaded runtime, neither should block on the other.

In essence, you're moving away from global time manipulation (which I agree is pretty fragile) to something more controlled than that. But there's naturally a divergence between the test and the real code, and I don't know if in this world the test is still useful.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Aug 6, 2024

Let's say you somehow arranged for MockZones::boot method to have access to a tokio::sync::oneshot::Receiver (e.g. via a task-local variable, or passing it in somehow -- not quite sure how mockall works).

Then, in MockZones::boot, you could wait on it with blocking_recv.

I'm not sure this will work. blocking_recv() will panic if called from within an async context, right? In this case, that's the runtime created by #[tokio::test]. That was actually how this test was originally written, and what led me to even notice the test failing!

@lifning seeing as this was your test originally, how strongly do you feel it be preserved? I don't see a straightforward way to pretend zone boot takes a long time, since we safely block the single-threaded Tokio runtime and still advance time from outside the closure. We could use a multi-threaded runtime, and have the test take the full duration of whatever slow boot process we wanted to emulate, but would not be able to advance time in that scenario.

@sunshowers
Copy link
Contributor

sunshowers commented Aug 6, 2024

I'm not sure this will work. blocking_recv() will panic if called from within an async context, right?

Ah not if you call spawn_blocking or block_in_place though! Which you can in a multi threaded runtime context.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Aug 6, 2024

@sunshowers Yeah, block_in_place() makes sense, if we bite the bullet and use a multi-threaded runtime. (spawn_blocking() won't work, since it returns a join handle which we can't await in the boot closure.) I'll take a swing at that soon, thanks for the discussion!

bnaecker added a commit that referenced this issue Aug 6, 2024
- Fixes #6197
- Use a multi-threaded runtime for the test so we can literally sleep in
  the boot closure without blocking the test itself.
@lifning
Copy link
Contributor

lifning commented Aug 6, 2024

(for context, i had posted some context from digging into an issue with this test last week at #6169 (comment) - is my conclusion there about channels panicking when runtimes tear down actually founded?)

@lifning seeing as this was your test originally, how strongly do you feel it be preserved?

the intent of adding it was for us to make sure we had something exercising our system's behavior in the timeout case, back when propolis zone creation was observed sometimes taking over a minute (and causing instance create requests to time out.)
if things change / have changed such that we have a better way of testing what happens if zone creation times out, or if instance management is overhauled in such a way that we can prove that it structurally cannot be an issue any more, then maybe it'd no longer be necessary, but i don't have current knowledge on whether or not we're at that point

@bnaecker
Copy link
Collaborator Author

bnaecker commented Aug 6, 2024

is my conclusion there about channels panicking when runtimes tear down actually founded?

I see this "channel closed" error quite frequently, and agree with you that it seems due to the unordered shutdown of Tokio's tasks.

the intent of adding it was for us to make sure we had something exercising our system's behavior in the timeout case, back when propolis zone creation was observed sometimes taking over a minute

Sounds reasonable. I think given that we found a way that (I think) works around the sync-async boundary issues in the test, it's good to keep the test around at this point. We can simulate slow-boot in a reasonable amount of time, which assuages my main concern that the test would just require waiting for many 10s of seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants