Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs and chown subprocesses calls are failing to internal error in ocluster-worker #248

Open
shonfeder opened this issue Sep 5, 2024 · 3 comments

Comments

@shonfeder
Copy link
Contributor

Noticed in https://github.com/tarides/infrastructure/issues/375#issuecomment-2329726772

ocluster-worker: internal error, uncaught exception:
                 Failure("\"zfs\" \"create\" \"--\" \"obuilder/state\" failed with exit status 1")
                 
2024-09-04 20:58.56         worker [INFO] Prune threshold not set and docker max df size is not. Will not check for low disk-space!
cannot open 'obuilder/cache-tmp': dataset does not exist
cannot open 'obuilder/state': dataset does not exist
2024-09-04 20:58.56    application [INFO] Exec "zfs" "create" "--" "obuilder/state"
cannot create 'obuilder/state': no such pool 'obuilder'
ocluster-worker: internal error, uncaught exception:
                 Failure("\"zfs\" \"create\" \"--\" \"obuilder/state\" failed with exit status 1")
                 
2024-09-04 20:59.06         worker [INFO] Prune threshold not set and docker max df size is not. Will not check for low disk-space!
2024-09-04 20:59.06    application [INFO] Exec "zfs" "destroy" "-R" "-f" "--" "obuilder/cache-tmp"
2024-09-04 20:59.07    application [INFO] Exec "chown" "0:0" "/Volumes/obuilder/state"
chown: /Volumes/obuilder/state: No such file or directory
ocluster-worker: internal error, uncaught exception:
                 Failure("\"chown\" \"0:0\" \"/Volumes/obuilder/state\" failed with exit status 1")
ocluster-worker: internal error, uncaught exception:
                 Failure("\"chown\" \"0:0\" \"/Volumes/obuilder/state\" failed with exit status 1")

administrator@m1-worker-03 ~ % grep -A 2 'internal error' ./ocluster.log | tail -n 10
                 Failure("\"zfs\" \"create\" \"--\" \"obuilder/state\" failed with exit status 1")

I suspect this may be causing, contributing to, or masking problems that lead the macos builders to stop being able to build jobs. But proper error handling for these sub process calls should be put in place regardless.

@shonfeder shonfeder self-assigned this Sep 5, 2024
@mtelvers
Copy link
Member

mtelvers commented Sep 5, 2024

These commands are executed once during the initialisation of obuilder. They ensure that the necessary folders are available and have the correct permissions.

These errors indicate that the ZFS file system is unavailable, failed or just not ready. Obuilder then exits, and launchctl restarts it after ~15 seconds when, hopefully, the situation has improved.

See this larger extract from the log. On the third restart, the ZFS modules were loaded, and the system started correctly.

2024-08-16 18:14.58         worker [INFO] Prune threshold not set and docker max df size is not. Will not check for low disk-space!
The ZFS modules are not loaded.
Try running '/sbin/kextload zfs.kext' as root to load them.
The ZFS modules are not loaded.
Try running '/sbin/kextload zfs.kext' as root to load them.
2024-08-16 18:14.58    application [INFO] Exec "zfs" "create" "--" "obuilder/state"
The ZFS modules are not loaded.
Try running '/sbin/kextload zfs.kext' as root to load them.
ocluster-worker: internal error, uncaught exception:
                 Failure("\"zfs\" \"create\" \"--\" \"obuilder/state\" failed with exit status 1")
                 
2024-08-16 18:15.08         worker [INFO] Prune threshold not set and docker max df size is not. Will not check for low disk-space!
cannot open 'obuilder/cache-tmp': dataset does not exist
cannot open 'obuilder/state': dataset does not exist
2024-08-16 18:15.08    application [INFO] Exec "zfs" "create" "--" "obuilder/state"
cannot create 'obuilder/state': no such pool 'obuilder'
ocluster-worker: internal error, uncaught exception:
                 Failure("\"zfs\" \"create\" \"--\" \"obuilder/state\" failed with exit status 1")
                 
2024-08-16 18:15.18         worker [INFO] Prune threshold not set and docker max df size is not. Will not check for low disk-space!
2024-08-16 18:15.18    application [INFO] Exec "zfs" "destroy" "-R" "-f" "--" "obuilder/cache-tmp"
Unmount successful for /Users/mac1000/.opam/download-cache
Unmount successful for /Users/mac1000/Library/Caches/Homebrew
Unmount successful for /Volumes/obuilder/cache-tmp
2024-08-16 18:15.22    application [INFO] Exec "chown" "0:0" "/Volumes/obuilder/state"
2024-08-16 18:15.22    application [INFO] Exec "chown" "0:0" "/Volumes/obuilder/result"
2024-08-16 18:15.22    application [INFO] Exec "chown" "0:0" "/Volumes/obuilder/cache"
cannot open 'obuilder/cache-tmp': dataset does not exist
2024-08-16 18:15.22    application [INFO] Exec "zfs" "create" "--" "obuilder/cache-tmp"
2024-08-16 18:15.22    application [INFO] Exec "chown" "0:0" "/Volumes/obuilder/cache-tmp"
2024-08-16 18:15.22         worker [INFO] Performing OBuilder self-test…

@shonfeder
Copy link
Contributor Author

That is helpful! So then probably a symptom and the not the cause of the problems we are seeing. Still worth cleaning this up, imo, as it will make it easier to debug in the future if we don't have to rely in interpreting apparent evidence of errors as signs of normal operation :)

@shonfeder
Copy link
Contributor Author

After reviewing the code structure, I see that adding proper error handling and reporting (by which I mean "internal errors" are not surfaced for routine and predictable error cases) would either be ad hoc or require a non-trivial reorganization. Given that this is only symptomatic of the problem we are seeing on the macos workers, and that I see no way that this will actually help us debug, I don't think it's something to prioritize at the moment.

@shonfeder shonfeder removed their assignment Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants