Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a init/health check for mountpoints #1639

Open
7 tasks
achimnol opened this issue Oct 19, 2023 · 3 comments · May be fixed by #2992
Open
7 tasks

Add a init/health check for mountpoints #1639

achimnol opened this issue Oct 19, 2023 · 3 comments · May be fixed by #2992
Assignees
Labels
comp:agent Related to Agent component comp:storage-proxy Related to Storage proxy component platform:enterprise Backend.AI Enterprise support. urgency:4 As soon as feasible, implementation is essential.
Milestone

Comments

@achimnol
Copy link
Member

achimnol commented Oct 19, 2023

There are many cases in production setups that filesystem mounts fail or get disconnected at runtinme. Also some high-performance NFS backends take a very long time (e.g., 1-2 minutes) to initialize their mounts when the server is rebooted.

Although we are dealing with this kind of issues by adding a busy-wait loop in the runner scripts or mangling some systemd configurations, but it is sometimes difficult to ensure the customers that it's not our problems.

Detection

HOWTO

Let's use mountpoint to check/wait if the given set of directories are actually mounted or not upon startup.
If they fail to mount for a long time (e.g., 10 minutes), let's make agents and storage-proxy to actively fail and show appropriate error logs.

Mitigation

  • Either:
    • Explicitly reject kernel creation requests with vfolder mounts on lost volumes with proper error messages.
    • Turn the agent's schedulable to off.
  • Make an alarm to the cluster administrator.
    • ...so that he/she could escalate the issue to the infrastructure team/manager.
    • After the infra-level issue is resolved, we could instruct the administrator to turn on schedulable of the agent or restart the agent.

Steps

Preview Give feedback
  1. comp:agent comp:manager
    fregataa
  2. comp:manager
    fregataa
  3. area:infrastructure comp:agent
    hoyajigi
  4. comp:agent
    kyujin-cho
  5. comp:agent
  6. comp:agent comp:manager
  7. area:infrastructure comp:agent effort:normal impact:invisible platform:enterprise type:feature urgency:3

Expected result

This will greatly reduce the field support efforts that are related to storage issues.

@achimnol achimnol added type:feature Add new features platform:enterprise Backend.AI Enterprise support. comp:agent Related to Agent component comp:storage-proxy Related to Storage proxy component urgency:4 As soon as feasible, implementation is essential. labels Oct 19, 2023
@achimnol achimnol added this to the 23.09 milestone Oct 19, 2023
@achimnol achimnol changed the title Add a mountpoint check routine during startup and health check Add a init/health check for mountpoints Oct 19, 2023
@achimnol achimnol modified the milestones: 23.09, 24.03 Jan 18, 2024
@achimnol achimnol removed the type:feature Add new features label Oct 18, 2024
@achimnol
Copy link
Member Author

achimnol commented Oct 23, 2024

#104 would be also a part of this issue, sharing and extending the same proposed monitoring alarm subsystem. @HyeockJinKim

@achimnol
Copy link
Member Author

achimnol commented Oct 23, 2024

Also a new consideration:

  • When the systemcall hangs up inside the kernel, even mountpoint command goes into the deadlock (D) state. If this happens, we should not indefinitely repeat executing mountpoint to avoid accumulation of the D-state processes. We should apply a subprocess-level timeout to detect if this happens and makes an alarm.

@achimnol achimnol assigned HyeockJinKim and unassigned hoyajigi and fregataa Oct 23, 2024
@HyeockJinKim
Copy link
Collaborator

HyeockJinKim commented Oct 28, 2024

I'm thinking of working with the following structure, and I'd like to ask for your thoughts on it. @achimnol

  1. perform a full mountpoint check with a timeout(like 5 sec) at a interval (like 5 min)
    • timeout and interval is configurable.
    • The periodic behavior is independent of the server
    • Registering and unregistering mount points in the inspector when mounting
  2. return a CheckResult value for success/failure after checking
  3. perform processing on CheckResult in AlertSystem (currently only logging)

@HyeockJinKim HyeockJinKim linked a pull request Oct 29, 2024 that will close this issue
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:agent Related to Agent component comp:storage-proxy Related to Storage proxy component platform:enterprise Backend.AI Enterprise support. urgency:4 As soon as feasible, implementation is essential.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants