-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fault Management (Analysis and Handling) #1520
Comments
HLD (md) PR: #1527 |
Dell registered as the reviewer. |
Community review recording https://zoom.us/rec/share/G4YPod_DoyMGGc8RG-A6jakAEOwR4INXe8pfG5IXrDKZS5ozbyghJyXASgEthkZq.8EmVrFbqX2t3qan3 |
@venkatmahalingam can you please let me know the github id of other reviewers from Dell? Thanks. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Basic Information (context)
Any failure (or an error) impacting a system/chassis or a sub-system is regarded as a fault.
Broadly classified into SW (Software) and HW (Hardware) faults:
They may occur at any of the following stages of system's functioning:
Present State
In SONiC, Fault is represented via an Event or an Alarm.
SONiC has Event Framework HLD which can help event-detector to publish its event to the eventD redisDB.
However, there is no Fault Manager/Handler which can take the needed/ platform-specified action(s) to recover the system from the generated fault.
Need for this feature
This feature aims at adding a generic FM (Fault Management) Infrastructure which can do the following:
Action could either be generic or platform specfic
Benefits
Platform supplied 'Fault-Action Policy table' has a holistic/system-level view of the platform (chassis/board/HWSKU) and can gauge the right action required to recover from the fault. It can either go with the recommended action (provided by the fault source/detector) or override it with the system-level one.
The text was updated successfully, but these errors were encountered: