Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what's involved with removing a sled? #4787

Closed
davepacheco opened this issue Jan 9, 2024 · 6 comments
Closed

what's involved with removing a sled? #4787

davepacheco opened this issue Jan 9, 2024 · 6 comments
Assignees

Comments

@davepacheco
Copy link
Collaborator

It'd be useful to have a written summary of exactly what has to happen as part of sled removal. This is analogous to #4651 but will include more things because we have to clean up anything that's been associated with the sled in its lifetime.

Off the top of my head, I assume we need to update/remove entries from: sled, physical_disk, zpool. Maybe dataset too? There's also stuff related to instances and Crucible regions. What exactly has to happen there?

Next step for me would probably be to search the schema for foreign keys pointing at any of these things (e.g., sled_id) and repeat recursively until we find nothing new.

@smklein
Copy link
Collaborator

smklein commented Jan 9, 2024

See also #4719 , there is a ton of overlap between the two (removal of a sled seems like it implicitly "deactivates" all disks attached to that sled)

@askfongjojo
Copy link

There are some related tickets for reference:

Besides the above,

  1. There are also log files, zone bundles, and other archived objects stored locally on the sled that we may want to migrate to somewhere else for retention.
  2. Propolis zones that didn't manage to be torn down (because the sled-agent already stopped responding or the sled is totally dead) might still have resource usage recorded in the database. It is possible that this is already taken care of when instances are moved to the failed state or when they are deleted by the user but we probably want to double-check that.

@askfongjojo
Copy link

There was also a "proof of concept" when @augustuswm had to remove sled 10 from rack3. The POC obviously didn't include disk/log data migration but covered all the database things, as noted in https://github.com/oxidecomputer/colo/issues/46.

@smklein
Copy link
Collaborator

smklein commented Jan 10, 2024

#612 is also related to this

@davepacheco
Copy link
Collaborator Author

2. Propolis zones that didn't manage to be torn down (because the sled-agent already stopped responding or the sled is totally dead) might still have resource usage recorded in the database. It is possible that this is already taken care of when instances are moved to the failed state or when they are deleted by the user but we probably want to double-check that.

For reference, I think this is now covered by #4872.

@davepacheco
Copy link
Collaborator Author

davepacheco commented Apr 17, 2024

Here are some notes from a bit of digging I just did. It's not exactly comprehensive but I wanted to look and see if there were obvious pieces we may have missed.

Broadly, we can divide sled state (or the cleanup actions for that state) into three categories:

  1. Sled-related state that needs to be cleaned up in Omicron, regardless of what else may have been running on this sled
  2. Customer Instances and Crucible volumes that were on the removed sled and now need to be dealt with
  3. Other component-specific actions that are required depending on what control plane services were running on this sled (e.g., for CockroachDB nodes, we'd need to tell CockroachDB that node is gone forever and presumably provision a new one).

In this issue, I'm mostly concerned with category 1. RFD 459 (which is still a work in progress) discusses categories 2 and 3. To summarize category 2: instances on an expunged sled need to be enter a failure path similar to what would happen of the sled rebooted. This depends on #4872. Crucible regions on an expunged sled need to be treated as gone forever, with Omicron and Crucible machinery getting kicked off to restore the expected number of copies for any affected volumes. Category 3 is complicated -- see the RFD for more.

Database state

Obviously there's one record in the sled table for each sled. Per RFD 457, this record will be taken care of by setting the sled policy to "expunged" and the sled state to "decommissioned".

sled_id (or something like it) is referenced by:

  • sled_resource table. I expect we should probably clean these up when expunging a sled, but also that it probably doesn't matter much since I think these exist only to avoid overcommitting resources on a sled, and we won't be provisioning new resources onto expunged sleds. I'd hope these records would also be cleaned up when we clean up instances once we have need a way to trigger cleanup and next steps for vanished instances #4872.
  • sled_underlay_subnet_allocation table. I expect we should be cleaning these up, and are not doing so today, and that this would be a problem if someone were to attempt about 222 sled replacements within a single rack (because this table stores unique underlay octets, which range between 33 and 255).
  • physical_disk table. These records are taken care of similarly to the sled record itself: the policy will become "expunged". This was implemented in Expunge disks when sleds are expunged #5369.
  • various inventory tables. I think we can ignore these. Only the most recent inventory is ever used, and new inventory collections should be made after a sled is expunged. It's correct for older ones to continue to reference sleds that are no longer present.
  • various blueprint tables. I think we can assume that dealing with these is already covered by other work. (It's correct for older blueprints to continue refering to expunged sleds. The expungement process will wind up with a target blueprint that no longer references an expunged sled except maybe in ways that are aware that the sled is expunged.)
  • vmm table. I expect these should get cleaned up but probably won't until we have (and use) need a way to trigger cleanup and next steps for vanished instances #4872.

Besides those direct consumers, physical disks are referenced by the zpool table. zpool is referenced by dataset.pool_id and some inventory tables. Again I think we can ignore inventory tables here. dataset is referenced by region.dataset_id and region_snapshot.dataset_id. I'm assuming that physical_disk, zpool, dataset, and region are well understood and covered by the open tickets around #4719 and Crucible region replacement.

I've filed new tickets:

Other persistent state

Internal and external DNS generally need to be updated when a sled is expunged. This work has already been done via #4989 and #5212.

Metric data in Clickhouse presumably should remain unchanged, since it still reflects useful historical information about that sled.

Switches have state about sleds (e.g., routes). These are configured via Dendrite. Existing background tasks in Nexus take
care of periodically updating switches to contain the expected configuration and these should cover sled expungement.

Runtime state

Generally, Nexus doesn't keep in-memory state outside the main set of database tables. Sagas are an important exception: they may have in-memory state and even persistent state (outputs from saga nodes) that may be invalidated when a sled (or a component running on a sled) becomes expunged.

These are reflected in a few issues:

For sagas that are trying forever to reach a component that's now gone: these should probably check the current target blueprint and stop trying to contact things that are known to be gone. They need to make a saga-specific decision about what that means. For regular actions, they could treat this as a failure (triggering an unwind); for unwind actions, we may want to design flows so that sagas don't need to do anything in this case. (For example, if the unwind action is cleaning something up that was on some instance that has since been expunged, the expungement of that instance probably ought to be responsible for cleaning that thing up, rather than the saga.)

#4259 is a little different in that a reliable persistent workflow (RPW) approach probably makes more sense.


That's about all I plan to do here for now. We may of course find new things during testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants