Add a init/health check for mountpoints #1639
Labels
comp:agent
Related to Agent component
comp:storage-proxy
Related to Storage proxy component
platform:enterprise
Backend.AI Enterprise support.
urgency:4
As soon as feasible, implementation is essential.
Milestone
There are many cases in production setups that filesystem mounts fail or get disconnected at runtinme. Also some high-performance NFS backends take a very long time (e.g., 1-2 minutes) to initialize their mounts when the server is rebooted.
Although we are dealing with this kind of issues by adding a busy-wait loop in the runner scripts or mangling some systemd configurations, but it is sometimes difficult to ensure the customers that it's not our problems.
Detection
HOWTO
Let's use
mountpoint
to check/wait if the given set of directories are actually mounted or not upon startup.If they fail to mount for a long time (e.g., 10 minutes), let's make agents and storage-proxy to actively fail and show appropriate error logs.
Mitigation
schedulable
to off.schedulable
of the agent or restart the agent.Steps
status_history
toagents
to keep track of status changes and reasons #1843mountpoint
health check with external monitoring systems like Zabbix (configurable) #1844mountpoint
by agents (configurable) #1846mountpoint
readiness on startup with configurable retries and timeouts #1847Expected result
This will greatly reduce the field support efforts that are related to storage issues.
The text was updated successfully, but these errors were encountered: