`test_instance_create_timeout_while_creating_zone` is flaky #6197

bnaecker · 2024-07-31T19:43:13Z

The test test_instance_create_timeout_while_creating_zone attempts to verify what happens when zone-boot times out while creating an instance. Prior to b74d691, that was inadvertently relying on a DNS resolution request that would never succeed to "time out" the zone boot. In that commit, we removed that DNS resolution, revealing the problem in the test.

However, the test remains flaky. It's relying on Tokio's time facilities to pause / advance time far enough that the we time out the request to create an instance, but not far enough that the zone actually boots. That kind of control of time is always finicky, since it's often the case that time needs to be advanced in small jumps to allow the runtime to schedule and drive other tasks in the meantime. In this case, we can apparently just never call the MockZones::boot() method, presumably because the test succeeds and that task is torn down before it's called.

We should probably make the test work better by advancing time in small jumps, well beyond the expected timeout for the zone boot call itself.

Originally posted by @hawkw in #5749 (comment)

The text was updated successfully, but these errors were encountered:

bnaecker · 2024-08-02T19:49:16Z

I've been looking at this again, because I'm touching these tests for something unrelated. I think it's not possible to do what we want in this test.

The intent behind the test is this:

Make a mock expectation, with a synchronous closure that waits for a long time, call it boot_timeout.
Pause Tokio time
Try to set an instance running
Apply a timeout to that future, that's less than boot_timeout, call it instance_timeout.
Step time beyond instance_timeout, but not beyond boot_timeout.
Observe that the timeout actually does time out

The problem is that this bridges async and sync code. The mockall closure must be synchronous. There are mechanisms to wait for events or timeouts in synchronous code, such as a std channel or just std::thread::sleep(). However! We're also trying to step time in the test body itself, and drive the test to completion.

If we call the mock closure at all before the test completes, it will block until whatever mechanism we use completes. That is a synchronous closure, and without an await point there is no way to yield back to the Tokio runtime to drive the rest of the tasks to completion. So what we want is a way to delay completing that closure, but that's not possible without indefinitely delaying the runtime itself.

sunshowers · 2024-08-05T18:20:58Z

This seems to be pretty common in practice. Would it be worth adding retries to it at the nextest level (if the fix may take a while to develop)?

bnaecker · 2024-08-05T18:39:10Z

Unless I'm missing something, I don't believe it's possible to write this test the way we want. I don't see how one can block the mock zone-boot method, without also blocking the single-threaded Tokio runtime. I.e., the test just needs to take the full time we are pretending zone boot takes (60s), or we fundamentally restructure the test or remove it. I personally feel the latter is the most expedient option, but I'd love to learn how we can keep the test.

sunshowers · 2024-08-05T21:18:23Z

Hmm, could you do something like injecting a oneshot channel to wait on? (Also might be worth using a multithreaded tokio runtime for this test -- we do it for a few others)

sunshowers · 2024-08-05T21:26:48Z

Hmm, a multithreaded runtime does mean we lose the ability to use tokio's pause/resume time events.

I guess the only real way to do this would be to pass the source of time down as a dependency. If that's too much work, it's probably just worth removing the test.

bnaecker · 2024-08-05T23:07:02Z

I guess the only real way to do this would be to pass the source of time down as a dependency

I think you mean something like: passing in a start time or deadline into the MockZones::boot() closure, and waiting until that's reached, right? If so, I tried that in a previous iteration, and I don't think that can work. That closure has no way to wait for anything that does not also block the Tokio runtime, since it's synchronous. So, if that closure is called at all, it will deadlock the test, since the "outer" code advancing the timer will never run.

If that closure does not wait, well then there's really no way for us to make booting appear slow! And since that's the whole point of the test, I feel like it lowers the value of the test.

@sunshowers Am I right in my assessment of the closure? Since it's sync, and we're controlling time in a single-threaded runtime, there's no way for it "appear slow" without blocking the test entirely.

sunshowers · 2024-08-06T02:28:36Z

I think you mean something like: passing in a start time or deadline into the MockZones::boot() closure, and waiting until that's reached, right? If so, I tried that in a previous iteration, and I don't think that can work. That closure has no way to wait for anything that does not also block the Tokio runtime, since it's synchronous. So, if that closure is called at all, it will deadlock the test, since the "outer" code advancing the timer will never run.

That's right in the context of a single-threaded runtime -- but using a multithreaded runtime should (I think) be able to work around that.

Let's say you somehow arranged for MockZones::boot method to have access to a tokio::sync::oneshot::Receiver (e.g. via a task-local variable, or passing it in somehow -- not quite sure how mockall works).

Then, in MockZones::boot, you could wait on it with blocking_recv.

And in the test itself, you could spin up a task which sends a message via the channel. With a multithreaded runtime, neither should block on the other.

In essence, you're moving away from global time manipulation (which I agree is pretty fragile) to something more controlled than that. But there's naturally a divergence between the test and the real code, and I don't know if in this world the test is still useful.

bnaecker · 2024-08-06T03:24:34Z

Let's say you somehow arranged for MockZones::boot method to have access to a tokio::sync::oneshot::Receiver (e.g. via a task-local variable, or passing it in somehow -- not quite sure how mockall works).

Then, in MockZones::boot, you could wait on it with blocking_recv.

I'm not sure this will work. blocking_recv() will panic if called from within an async context, right? In this case, that's the runtime created by #[tokio::test]. That was actually how this test was originally written, and what led me to even notice the test failing!

@lifning seeing as this was your test originally, how strongly do you feel it be preserved? I don't see a straightforward way to pretend zone boot takes a long time, since we safely block the single-threaded Tokio runtime and still advance time from outside the closure. We could use a multi-threaded runtime, and have the test take the full duration of whatever slow boot process we wanted to emulate, but would not be able to advance time in that scenario.

sunshowers · 2024-08-06T05:10:22Z

I'm not sure this will work. blocking_recv() will panic if called from within an async context, right?

Ah not if you call spawn_blocking or block_in_place though! Which you can in a multi threaded runtime context.

bnaecker · 2024-08-06T17:04:53Z

@sunshowers Yeah, block_in_place() makes sense, if we bite the bullet and use a multi-threaded runtime. (spawn_blocking() won't work, since it returns a join handle which we can't await in the boot closure.) I'll take a swing at that soon, thanks for the discussion!

- Fixes #6197 - Use a multi-threaded runtime for the test so we can literally sleep in the boot closure without blocking the test itself.

lifning · 2024-08-06T18:10:27Z

(for context, i had posted some context from digging into an issue with this test last week at #6169 (comment) - is my conclusion there about channels panicking when runtimes tear down actually founded?)

@lifning seeing as this was your test originally, how strongly do you feel it be preserved?

the intent of adding it was for us to make sure we had something exercising our system's behavior in the timeout case, back when propolis zone creation was observed sometimes taking over a minute (and causing instance create requests to time out.)
if things change / have changed such that we have a better way of testing what happens if zone creation times out, or if instance management is overhauled in such a way that we can prove that it structurally cannot be an issue any more, then maybe it'd no longer be necessary, but i don't have current knowledge on whether or not we're at that point

bnaecker · 2024-08-06T18:17:35Z

is my conclusion there about channels panicking when runtimes tear down actually founded?

I see this "channel closed" error quite frequently, and agree with you that it seems due to the unordered shutdown of Tokio's tasks.

the intent of adding it was for us to make sure we had something exercising our system's behavior in the timeout case, back when propolis zone creation was observed sometimes taking over a minute

Sounds reasonable. I think given that we found a way that (I think) works around the sync-async boundary issues in the test, it's good to keep the test around at this point. We can simulate slow-boot in a reasonable amount of time, which assuages my main concern that the test would just require waiting for many 10s of seconds.

bnaecker self-assigned this Jul 31, 2024

sunshowers added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Aug 3, 2024

bnaecker added a commit that referenced this issue Aug 6, 2024

Fix flaky instance zone-boot timeout test

e45b7f1

- Fixes #6197 - Use a multi-threaded runtime for the test so we can literally sleep in the boot closure without blocking the test itself.

bnaecker mentioned this issue Aug 6, 2024

Fix flaky instance zone-boot timeout test #6238

Merged

bnaecker closed this as completed in 5beedcf Aug 6, 2024

bnaecker closed this as completed in #6238 Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`test_instance_create_timeout_while_creating_zone` is flaky #6197

`test_instance_create_timeout_while_creating_zone` is flaky #6197

bnaecker commented Jul 31, 2024

bnaecker commented Aug 2, 2024

sunshowers commented Aug 5, 2024 •

edited

Loading

bnaecker commented Aug 5, 2024 •

edited

Loading

sunshowers commented Aug 5, 2024 •

edited

Loading

sunshowers commented Aug 5, 2024

bnaecker commented Aug 5, 2024

sunshowers commented Aug 6, 2024 •

edited

Loading

bnaecker commented Aug 6, 2024

sunshowers commented Aug 6, 2024 •

edited

Loading

bnaecker commented Aug 6, 2024

lifning commented Aug 6, 2024

bnaecker commented Aug 6, 2024

test_instance_create_timeout_while_creating_zone is flaky #6197

test_instance_create_timeout_while_creating_zone is flaky #6197

Comments

bnaecker commented Jul 31, 2024

bnaecker commented Aug 2, 2024

sunshowers commented Aug 5, 2024 • edited Loading

bnaecker commented Aug 5, 2024 • edited Loading

sunshowers commented Aug 5, 2024 • edited Loading

sunshowers commented Aug 5, 2024

bnaecker commented Aug 5, 2024

sunshowers commented Aug 6, 2024 • edited Loading

bnaecker commented Aug 6, 2024

sunshowers commented Aug 6, 2024 • edited Loading

bnaecker commented Aug 6, 2024

lifning commented Aug 6, 2024

bnaecker commented Aug 6, 2024

`test_instance_create_timeout_while_creating_zone` is flaky #6197

`test_instance_create_timeout_while_creating_zone` is flaky #6197

sunshowers commented Aug 5, 2024 •

edited

Loading

bnaecker commented Aug 5, 2024 •

edited

Loading

sunshowers commented Aug 5, 2024 •

edited

Loading

sunshowers commented Aug 6, 2024 •

edited

Loading

sunshowers commented Aug 6, 2024 •

edited

Loading