Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only send flushes when Downstairs is idle; send Barrier otherwise #1505

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Oct 14, 2024

This PR removes automatic flushes, per RFD 518. Instead, the new Barrier operation is sent. If the system is idle for a particular amount of time, we send a final Flush to put everything into a known state.

When the Upstairs retires jobs after a barrier operation, the system as a whole becomes ineligible for replay. This state determines whether the new Downstairs reconnects through Offline (which does replay) or Faulted (which does live-repair instead).

Removing automatic flushes is a noticeable performance improvement:

Screenshot 2024-10-11 at 11 06 26 AM

(tested on the London mini-cluster, with Upstairs and 3x Downstairs on different sleds)

region info:
  block size:      4096 bytes
  blocks / extent: 16384
  extent size:     64 MiB
  extent count:    2048
  total blocks:    33554432
  total size:      128 GiB
  encryption:      yes

@mkeeter mkeeter requested review from jmpesp and leftwo October 14, 2024 17:54
@mkeeter mkeeter force-pushed the mkeeter/no-auto-flush branch 2 times, most recently from b0c092b to 3a7e2f8 Compare October 15, 2024 20:40
@mkeeter mkeeter changed the base branch from main to mkeeter/io-state-job-and-byte-count October 15, 2024 20:40
@mkeeter
Copy link
Contributor Author

mkeeter commented Oct 15, 2024

Rebased to stage on top of #1507, because we want to only send Barrier operations if there's enough bytes / jobs buffered (which requires keeping track!)

@mkeeter mkeeter force-pushed the mkeeter/io-state-job-and-byte-count branch from 7b7ec22 to 1e4cc53 Compare October 16, 2024 13:48
Base automatically changed from mkeeter/io-state-job-and-byte-count to main October 16, 2024 14:20
@mkeeter mkeeter force-pushed the mkeeter/no-auto-flush branch 2 times, most recently from 796a397 to 2d9dba3 Compare October 17, 2024 21:32
upstairs/src/downstairs.rs Show resolved Hide resolved
panic!("expected Barrier, got message {m:?}");
}
harness.ds2.ack_barrier().await;
harness.ds3.ack_barrier().await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any way to check that replay is no longer available to this downstairs? I'm not sure if the test has access at this layer to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end of this unit test is checking that live-repair works, so I think we're implicitly testing that we don't do replay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was thinking more that the sending of a barrier operation has also flipped the can_replay in the Downstairs struct. I'm not sure if we can do that though here (or in a test elsewhere), as can_replay might not be exposed like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not easy to check from the outside (when we only have the Guest handle). If you really want it, we could probably add it to the downstairs_state helper function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's worth adding it to the downstairs_state helper function.
I don't see anywhere else that seems like a good place either. Nothing in the upstairs tests cover this case.

@leftwo
Copy link
Contributor

leftwo commented Oct 29, 2024

With this, we can probably close #1358 as fixed

@mkeeter mkeeter force-pushed the mkeeter/no-auto-flush branch 2 times, most recently from 5cc9448 to be0f66d Compare October 30, 2024 19:50
Copy link
Contributor

@faithanalog faithanalog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably tighten up IO_CACHED_MAX_BYTES soon. Right now we can theoretically buffer up to 1GiB of jobs for replay, and that feels like quite a lot to me. We usually won't- we will only get to that point if the guest is doing large amounts of IO, continuously so we never go idle, and the guest never sends a flush. That's unlikely/rare during normal filesystem operations (but will be easier to hit when writing to the raw block device like iodriver does). Because of that, we shouldn't expect more than a few VMs to hit this under normal operation. But in interest in not overcommitting resources, I think that bound should be lower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants