One bad CSI volume can stop all volumes from being scheduled #3120

s4ke · 2023-02-14T13:54:46Z

While implementing CSI support for Hetzner Cloud we ran into some strange behaviour around cases where one "broken" volume can cause all other volumes to not schedule properly.

In the case of the Hetzner plugin this was caused by Volumes created in hel1 generally not being allowed to be scheduled in nbg1 - which is expected. But due to the fact that this volume was also tried to be published in the node in nbg1 (How would we prevent this? Do/Should Cluster volumes have support for placement constraints?) this will never work on these nodes. Another volume created in nbg1, which should succeed, was however blocked by the earlier failure from succeeding.

Only after force removing all volumes and services and starting from scratch (and this time making sure to only schedule stuff in nbg1), we were able to properly create a volume.

see hetznercloud/csi-driver#376 (comment) for further details

s4ke · 2023-02-14T13:59:50Z

@neersighted should we create another issue for cluster volumes not having support for placement constraints?

s4ke · 2023-05-16T06:40:05Z

@dperny Can you take a look here - the linked comment on the hetznercloud csi driver explains the issue quite well? I would love to help in any way I can here.

dperny · 2023-06-20T16:03:52Z

I'm taking a look at this issue now.

dperny · 2023-06-21T16:10:44Z

@s4ke I believe there are several issues afoot here:

There is some error causing volumes to attempt to be scheduled to invalid nodes.
There is some error resulting in other errors locking up the volume management component.
There is some other error that may result in nodes incorrectly reporting volume status to the managers.

It's all quite nasty.

s4ke · 2023-06-21T16:46:51Z

@dperny okay. Thanks for taking a look. Were you able to reproduce this issue? Do you think that this is something on the driver level or in swarmkit?

dperny · 2023-06-21T17:13:10Z

So, the open questions I have right now about the linked issue:

The Volume is getting scheduled to a node which is outside of its availability constraint. This is odd. However, the CSIInfo field shows only PluginName, NodeID, and MaxVolumesPerNode. It does not seem to show AccessibleTopology. CSI volumes are scheduled based on AccessibleTopology as reported by the Node (which it gets from the plugin), and not by Labels on the Node (which are used for regular placement constraints).

But that said, even a blank AccessibleTopology should not result in a decision to schedule the Volume to a the Node. I've checked the function myself, even written a test. It should not be scheduled. So the question remains, why?

Further, there is some issue that I know of that I believe is related by which a Volume object loses the PublishStatus for all nodes, and ends up back in Created status. So I know there is an issue, somewhere, with the Volume PublishStatus being incorrectly set. The question for that is also why?

Next, why is the Volume ID not being included in the ControllerPublishVolume request, as the logs seem to indicate? That shouldn't be possible. The Volume is not considered Created until it has its ID.

There's a rats nest of issues that I suspect are all related to a small set of root causes. Whoever wrote this CSI code is a doofus.

s4ke · 2023-06-21T17:56:48Z

I will see what I can do to help answer your questions and when.

s4ke · 2023-06-21T17:58:07Z

Could this be related? moby/moby#45547

dperny · 2023-06-22T15:28:03Z

That's exactly the issue I had in mind.

dperny · 2023-06-22T16:29:47Z

OK, for starters, I have figured out one problem.

This is where we convert the gRPC response into Docker API objects:

https://github.com/moby/moby/blob/b3843992fc12536908fea2fea3ece05725b1e613/daemon/cluster/convert/node.go#L59-L70

And this is the Docker API object in question:

https://github.com/moby/moby/blob/b3843992fc12536908fea2fea3ece05725b1e613/api/types/swarm/node.go#L72-L85

It seems I am forgetting something critical in the conversion. AccessibleTopology is being ignored in the conversion, which makes debugging this issue difficult. The Scheduler takes into account the AccessibleTopology of the Node as reported by the CSI plugin to make its scheduling decisions.

Without knowing what AccessibleTopology is being reported, I cannot know if the Scheduler is making an error, or is suffering from garbage-in-garbage-out.

s4ke · 2023-09-26T14:39:07Z

It's been a while since I was in this discussion. I honestly lost track of where we are with this. How can I help with this? Are the questions still open?

s4ke changed the title ~~One bad CSI volume can stop the all volumes from being scheduled~~ One bad CSI volume can stop all volumes from being scheduled May 20, 2023

Gradlon mentioned this issue Jul 13, 2024

Docker Swarm compatibility seaweedfs/seaweedfs-csi-driver#98

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One bad CSI volume can stop all volumes from being scheduled #3120

One bad CSI volume can stop all volumes from being scheduled #3120

s4ke commented Feb 14, 2023

s4ke commented Feb 14, 2023

s4ke commented May 16, 2023

dperny commented Jun 20, 2023

dperny commented Jun 21, 2023

s4ke commented Jun 21, 2023

dperny commented Jun 21, 2023

s4ke commented Jun 21, 2023

s4ke commented Jun 21, 2023

dperny commented Jun 22, 2023

dperny commented Jun 22, 2023

s4ke commented Sep 26, 2023

One bad CSI volume can stop all volumes from being scheduled #3120

One bad CSI volume can stop all volumes from being scheduled #3120

Comments

s4ke commented Feb 14, 2023

s4ke commented Feb 14, 2023

s4ke commented May 16, 2023

dperny commented Jun 20, 2023

dperny commented Jun 21, 2023

s4ke commented Jun 21, 2023

dperny commented Jun 21, 2023

s4ke commented Jun 21, 2023

s4ke commented Jun 21, 2023

dperny commented Jun 22, 2023

dperny commented Jun 22, 2023

s4ke commented Sep 26, 2023