Add a init/health check for mountpoints #1639

achimnol · 2023-10-19T11:46:11Z

There are many cases in production setups that filesystem mounts fail or get disconnected at runtinme. Also some high-performance NFS backends take a very long time (e.g., 1-2 minutes) to initialize their mounts when the server is rebooted.

Although we are dealing with this kind of issues by adding a busy-wait loop in the runner scripts or mangling some systemd configurations, but it is sometimes difficult to ensure the customers that it's not our problems.

Detection

HOWTO

Let's use mountpoint to check/wait if the given set of directories are actually mounted or not upon startup.
If they fail to mount for a long time (e.g., 10 minutes), let's make agents and storage-proxy to actively fail and show appropriate error logs.

Mitigation

Either:
- Explicitly reject kernel creation requests with vfolder mounts on lost volumes with proper error messages.
- Turn the agent's schedulable to off.
Make an alarm to the cluster administrator.
- ...so that he/she could escalate the issue to the infrastructure team/manager.
- After the infra-level issue is resolved, we could instruct the administrator to turn on schedulable of the agent or restart the agent.

Steps

Give feedback

Add status_history to agents to keep track of status changes and reasons #1843

comp:agent comp:manager
Add a manager endpoint to publish monitoring alarm events #1845

comp:manager
Integrate mountpoint health check with external monitoring systems like Zabbix (configurable) #1844

area:infrastructure comp:agent
Perform periodic health checks of mountpoint by agents (configurable) #1846

comp:agent
Check mountpoint readiness on startup with configurable retries and timeouts #1847

comp:agent
Webhook support for monitoring alarms #104

comp:agent comp:manager
Avoid assigning new sessions if free disk space of agent is below a threshold #3055

area:infrastructure comp:agent effort:normal impact:invisible platform:enterprise type:feature urgency:3
Options

Expected result

This will greatly reduce the field support efforts that are related to storage issues.

The text was updated successfully, but these errors were encountered:

achimnol · 2024-10-23T06:58:33Z

#104 would be also a part of this issue, sharing and extending the same proposed monitoring alarm subsystem. @HyeockJinKim

achimnol · 2024-10-23T07:01:05Z

Also a new consideration:

When the systemcall hangs up inside the kernel, even mountpoint command goes into the deadlock (D) state. If this happens, we should not indefinitely repeat executing mountpoint to avoid accumulation of the D-state processes. We should apply a subprocess-level timeout to detect if this happens and makes an alarm.

HyeockJinKim · 2024-10-28T02:32:34Z

I'm thinking of working with the following structure, and I'd like to ask for your thoughts on it. @achimnol

perform a full mountpoint check with a timeout(like 5 sec) at a interval (like 5 min)
- timeout and interval is configurable.
- The periodic behavior is independent of the server
- Registering and unregistering mount points in the inspector when mounting
return a CheckResult value for success/failure after checking
perform processing on CheckResult in AlertSystem (currently only logging)

achimnol added type:feature Add new features platform:enterprise Backend.AI Enterprise support. comp:agent Related to Agent component comp:storage-proxy Related to Storage proxy component urgency:4 As soon as feasible, implementation is essential. labels Oct 19, 2023

achimnol added this to the 23.09 milestone Oct 19, 2023

achimnol changed the title ~~Add a mountpoint check routine during startup and health check~~ Add a init/health check for mountpoints Oct 19, 2023

achimnol assigned kyujin-cho and fregataa Dec 8, 2023

achimnol assigned hoyajigi Jan 18, 2024

achimnol modified the milestones: 23.09, 24.03 Jan 18, 2024

achimnol mentioned this issue Apr 12, 2024

Explore Vector as a log and metric exporter #2018

Open

achimnol removed the type:feature Add new features label Oct 18, 2024

achimnol assigned HyeockJinKim and unassigned hoyajigi and fregataa Oct 23, 2024

HyeockJinKim linked a pull request Oct 29, 2024 that will close this issue

feat: Added function to check for mounted filesystems #2992

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a init/health check for mountpoints #1639

Add a init/health check for mountpoints #1639

achimnol commented Oct 19, 2023 •

edited

Loading

Steps

achimnol commented Oct 23, 2024 •

edited

Loading

achimnol commented Oct 23, 2024 •

edited

Loading

HyeockJinKim commented Oct 28, 2024 •

edited

Loading

Add a init/health check for mountpoints #1639

Add a init/health check for mountpoints #1639

Comments

achimnol commented Oct 19, 2023 • edited Loading

Detection

HOWTO

Mitigation

Steps

Expected result

achimnol commented Oct 23, 2024 • edited Loading

achimnol commented Oct 23, 2024 • edited Loading

HyeockJinKim commented Oct 28, 2024 • edited Loading

achimnol commented Oct 19, 2023 •

edited

Loading

achimnol commented Oct 23, 2024 •

edited

Loading

achimnol commented Oct 23, 2024 •

edited

Loading

HyeockJinKim commented Oct 28, 2024 •

edited

Loading