Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
baremetal: improve debuggability of ipi deployments #328
baremetal: improve debuggability of ipi deployments #328
Changes from 5 commits
a3bf88f
45ad3cd
0b73120
8b05817
1440b6f
545d738
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openshift/installer#2569
^^ already looking at making these problems more easy to report in the long term.
For now the installer now has list of common failures and how to identity them in https://github.com/openshift/installer/blob/master/docs/user/troubleshootingbootstrap.md#common-failures
the goal is to curate a list of detectable failures and then automatically do it as part of analysis.
the initial approach in 2569 was that, you show users most failure logs from the bundle and let them decide for themselves, but personally i would like us to come up with common known failure list and then just show this was the error, and here's how you might resolve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to make sure this behavior is similar to behavior for other platforms. Should the MAO or CBO be marked degraded when 1 worker fails to deploy? What if that is an issue with the worker itself? How can we distinguish between errors in the resource being provisioned verses an issue in the control plane?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all very good questions! It would be helpful to understand from someone in MAO (maybe @enxebre) how it works for other platforms, but my understanding is MAO doesn't go degraded from this case.
I think that MAO should show degraded if replicas are not met, or at the very least, if replicas are < 2, since we know we need 2 to get a working cluster on day 1 (unless controlplane is scheduable).
Perhaps cluster-operator-baremetal should also go degraded if provisioning fails, with more specific error messages bubbled up from baremetal-operator/ironic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a discussion with the MAO team this is what I learnt. The MAO would go into a degraded state only when the pods that it is responsible for deploying, fail to come up. When a resource it manages, in this case a worker Machine, does not come up, the MAO does not go into a failed/degraded state for other platforms and should probably be the same for baremetal too.
When the initial deployment with 2 workers fails, then it should be considered an Installer error and we should provide the best/detailed errors message we can provide by bubbling up what we can get from BMO and/or Ironic. The MAO team believes that this is not a reason to put the operator in a degraded state.
Since the baremetal platform is special, we could come up with a semantic on day 2 where if we notice a large number of worker failures (for example 90% of workers are not coming up), then the aggregated bad state could result in the operator being put in the degraded state. Currently, the operator does not have an aggregated view, so that needs to be added to the SLO at a future time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it doesn't go to a degraded state for anyone and I think that's a mistake and what I'm proposing to change here.
The main way the installer gets information about deployment success is largely through operator states via CVO, it would need a special case to count workers meeting the requested number of replicas.
If MAO can't provision machines, why is that not a degraded state of the operator?
CC: @abhinavdahiya
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stbenjam This docmentation https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#what-should-an-operator-report-with-clusteroperator-custom-resource helped me understand the semantics behind different Operator states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are countless transient scenarios for "Failure to deploy a worker". This makes impractical putting a reasonable generic semantic on top of it. And so this makes worthless to let the overall operator going degraded in such a heterogeneous scenario.
Instead I believe the boundaries to signal the details of theses errors belong to individual machine resource conditions and any lower level resource. Just like we do for any other provider https://github.com/openshift/cluster-api-provider-aws/blob/master/pkg/apis/awsprovider/v1beta1/awsproviderstatus_types.go#L40
Then to communicate "Failure to deploy a worker" We already trigger alerts any time a machine has no node regardless of the failure details and regardless the provider. So each failure details can then be analysed in the format described above.
Beyond all the above, regardless of the failure details and based on the overall health of the cluster (e.g 99 out 100 machines has no node) we might decide our criteria for a semantic that represents a permanent global issue and choose to let the mao going degraded in that case. But that's a separate scoped discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These don't show up in the installer output, and as far as I know they do try to capture alerts.
If you've ever done an install and ended up with a non-viable cluster due to insufficient workers, the UX is unacceptable. You get a report of a dozen failing operators and absolutely no indication it's because you don't have enough workers. @sdodson previously mentioned maybe we could do something in the installer about it (#328 (comment)), which may help the problem I guess, but doesn't feel like the right solution to me as machine-api-operator, being the top-level operator for dealing with machines, should be signaling clearly about the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'd be then a very a specific issue: "Installer output not capturing some existing alerts as expected".
I agree. That scenario should be covered by my last statement in the previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's on the operator owners to make sure the errors are clear. Think about how it is not only installer that is the consumer of these message, but also admins during upgrades.
So personally the goal should be to ensure that each operator is responsible for using clear error messages in the status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it's just on day 1 the worker deployment failure seems to be special to me. It causes a lot of noise as a bunch of operators start reporting error messages that make it hard to point to a root cause unless you've seen the problem before. I don't think machine-api operator even reports anything useful when this happens, but if it did, it'd get lost in mix of the many other failing operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we take ingress/console operator as an example - if the worker fails, they will be in error - but, from my experience of installing openshift for first few times - the user will have no idea that this is the reason. He will just see that those operators are down.
What is possible maybe to do is to have some kind of 'validators' - either from the installer binary or as an operator - that can analyze logs or cluster runtime state (with minimal requirement for cluster functionality - such as passwordless ssh between nodes) that can look into the state of the cluster and explain the user what went wrong. If we provide an infra for writing those validators, then operators owners / qe / intergration team will be able to enhance those once they rn into an issue that it was hard to analyze.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they asked for a certain number of works and they didn't get them that seems reasonable to have a special error for that.
I also think some generic orientation regarding how to investigate operators failing may help as well. They'll need to learn that skill eventually no matter what. So documenting how to look at an Operator's status and referencing that seems to be something worth doing no matter what. Ingress for example often tells you that the dns entry doesn't exist but people don't even know where to look for that.