Add broken status #357

AllKind · 2023-10-23T14:11:31Z

If either a dkms module source, or the symbolic link pointing to it is missing, the output of dkms status will be messed up. Add a new status called 'broken', which will inform the user about it in a nicely formatted way.

A recently discovered error in the ZFS installation routine lead to this issue: openzfs/zfs#15336

The dkms module sources were deleted, but the files in the dkms tree were still there.
Therefor the symbolic link 'source' was dangling.

This led to dkms status output like this:


dkms status
nvidia/525.125.06, 5.15.0-87-generic, x86_64: installed
nvidia/525.125.06, 5.15.135-custom, x86_64: installed
nvidia/525.125.06, 5.15.136-custom, x86_64: installedError! Could not locate dkms.conf file.
File: /var/lib/dkms/zfs/2.1.13/source/dkms.conf does not exist.

zenpower/0.1.12, 5.15.0-87-generic, x86_64: installed
zenpower/0.1.12, 5.15.135-custom, x86_64: installed
zenpower/0.1.12, 5.15.136-custom, x86_64: installed

dkms status
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/zfs/2.1.12/source/dkms.conf does not exist.

With this patch, the output will be like this:

dkms status 
zfs/2.1.13: broken
 - Missing module source directory, or the symbolic link pointing to it

Things, which are IMHO for discussion are (of course only in case the change gets accepted):
If the "Missing ..." message should be at the same line as "broken", or on a line on its own.
For the latter possibly with sent to stderr. Which would have the advantage eventually not to break possibly existing scripts, which parse the output of dkms status.

Just as an idea... In the future other things may be checked for a healthy state. I.e. the installed modules. If found broken, reported as such with this new state.
I didn't look into the codebase more deeply yet, to propose something more concrete. But maybe I will. If I come up with something reasonable, I'll put it up for discussion.

Have a nice day!

evelikov · 2023-10-24T14:33:48Z

We just landed a PR to make the error fatal - see #354 and related discussion #352.

From what I can see in the zfs issue, it seems like the release/package was completely broken leading to the problem. Is that correct?

Overall I like the idea, although am cautious for regressions in the XXXs users out there parsing dkms status output. That said, currently such users will also choke as they see the example you posted. So I believe we should be safe on that front.

Some requests:

Add separate commit reverting my earlier PR
As you pointed out, the error line should really be send to stderr
Please document the new status in the manual
Can you add some tests - especially for operations that feed off the (internal) dkms status

Glancing through the codebase - match, autoinstall, any non status action with --all, uninstall all use a variation of status. There are bunch of tests to serve as examples for all but match, so don't worry about that one.

Thanks

AllKind · 2023-10-24T15:02:57Z

From what I can see in the zfs issue, it seems like the release/package was completely broken leading to the problem. Is that correct?

Various reasons from what I understand.

-: dkms status output changed at some point, which broke the package scriptlet.
-: zfs config.h not packaged properly, which also broke the package scriptlet.
-: The proposed fix used the RPM scriptlet hook %posttrans, which broke the debian packages.

All lead to various problems, when installing, re-installing, upgrading or removing.

Tomorrow I will be gone for 10 days.
After that I will look into the things you said.

evelikov · 2023-10-24T15:08:32Z

IIRC the dkms status did change a while back, so we did good up on that front.

Thanks for the work unwrapping this. Enjoy you time AFK o/

evelikov · 2023-10-24T15:09:52Z

Fwiw at a later point, if you feel like adding some zfs tests into our CI that would be deeply appreciated, although is entirely optional.

evelikov · 2023-11-29T11:39:09Z

@AllKind did you have the time to get back to this? Introducing a "broken" status is reasonable, although we'd need to update the documentation and add some test(s) as mentioned above. Thanks

AllKind · 2023-12-02T20:26:47Z

@evelikov
Sorry for the delay. I got sick and then got buried in other things.
It's still on my radar. I will try to find time the next days.

AllKind · 2023-12-03T14:09:47Z

@evelikov
I updated this PR according to your feedback.

1: I reverted the requested commit. It makes sense to not interrupt at dkms status, as other modules may be installed.

2: The broken status is displayed to the user to stdout as all the other states. Plus an extra line to stderr, which describes the problem.

3: I took a look at the internal functions. I added a check for the 'broken' status in do_autoinstall(). Also I modified is_module_added(), to only report a module as added, if both the source directory and the symlink pointing to it exist.
As far as I can see, that should cover it, as all other functions die at read_conf(). But maybe I missed something.
Review needed! ;-)

4: The man page was updated.

Regarding adding tests, I don't know what to do there. Some more details would be useful.
For now I just took a glance at run_test.sh ... lot's to read there.
Anyhow, if I do that, it could be in a different PR.

evelikov · 2023-12-05T13:21:47Z

Welcome back o/

Thinking about the tests, here are some rough pseudo-code ideas:

non-status action that supports "--all" - aka build/install, remove, unbuild, uninstall

dkms add one-in-tree-demo-module X
dkms add another-in-tree-demo-module Y
"break" module Y
for module in X Y; dkms build --all $module
dkms status observe/match the broken module
for module in X Y dkms remove --all $module

autoinstall

Copy/edit the existing autoinstall test to:

"break" one (or all if you prefer) modules
dkms autoinstall ... existing test sets -k which is undocumented and should not allow "--all" (couple of issues if you'd want to help)
dkms status observe/match the broken module and newly installed ones

match

No ideas ... if you can come up with any, that'll be great although optional.

I would *request that we elaborate/define in the manual page how a broken module affects the different actions.
At a glance, it seems like we want to behave as-if the module isn't there for all actions but "remove". Where the latter will remove the broken module. If you think that behaviour should be different or user-controllable that also works for me.

NOTE: build --all and install --all are broken/disabled atm, so you can use -k/-a if we don't fix that soon (tm).

AllKind · 2023-12-06T21:27:13Z

I would *request that we elaborate/define in the manual page how a broken module affects the different actions.
At a glance, it seems like we want to behave as-if the module isn't there for all actions but "remove". Where the latter will remove the broken module. If you think that behaviour should be different or user-controllable that also works for me.

Broken in this PR means, that either the module source (and the dkms.conf) is missing, or the symlink 'source' pointing to it.
If the module source is missing, add, build, install, etc. of course cannot work.
If only the symlink is missing, in theory one could re-create it. But to reach the broken state things must have gone bad.
So my opinion is, user intervention is needed. And safer than to automatically try to fix something, where there isn't really a way to tell what is wrong. Which also introduces the risk that the user is not noticing a problem, or masking the root problem, if in a later step we run into an error.

AllKind · 2023-12-06T21:33:08Z

One thing I wanted to point out is:
[[ -L $dkms_tree/$1/$2/source || -d $dkms_tree/$1/$2/source ]]; was the check in is_module_added().
This was flawed already. It's an OR condition (which allowed the broken state in the first place). I changed it to be an AND condition. That's also the reason the is_module_broken check needs to run before is_module_added.

evelikov · 2023-12-07T12:21:26Z

Broken in this PR means, that either the module source (and the dkms.conf) is missing, or the symlink 'source' pointing to it.
If the module source is missing, add, build, install, etc. of course cannot work.

That's my understanding as well.

If only the symlink is missing, in theory one could re-create it. But to reach the broken state things must have gone bad.
So my opinion is, user intervention is needed. And safer than to automatically try to fix something, where there isn't really a way to tell what is wrong. Which also introduces the risk that the user is not noticing a problem, or masking the root problem, if in a later step we run into an error.

Fully agreed, let's avoid silently fixing things.

evelikov · 2023-12-07T12:23:40Z

In case it wasn't obvious - it's fine to issue dkms build or install on a broken module. The test can track the error code and message - see run_with_expected_error.

AllKind · 2023-12-08T14:58:53Z

Just pushed a new version with a lot of changes. Description is in the updated commit message.
It's right now completely untested - I'll see if the automated tests turn up things.
No more time left today, I'll try to continue tomorrow.

I'd just like to confirm if the changes go into the right direction - conceptual wise.

Also I have two questions:

1 - About the exit states on die() - Could not find any documentation in the source. What's the conventions?

2 - As I've never seen this before.... Why is every echo command prefixed with a $ sign? echo $"bla bla"

evelikov · 2023-12-08T15:40:38Z

It's right now completely untested - I'll see if the automated tests turn up things.

Seems like existing tests caught some breakage already. Hazzah for tests catching issues.

Cannot see any new tests, not sure if you meant to git add run-tests.sh just yet - if fine if you don't. Although please git rm .dkms.in.swp for the next round.

I'd just like to confirm if the changes go into the right direction - conceptual wise.

Instead of looking at the fixes/code alone, can I suggest opting for another route - TDD:

define the expectations - what should fail and when, what should succeed
write some tests that enforce/validate that behaviour
churn through the code until the tests are happy

1 - About the exit states on die() - Could not find any documentation in the source. What's the conventions?

As general rule - code should use die and never exit. The err codes are somewhat arbitrarily picked - the only ones that truly matter are 0 (success) and 77 (skip).

2 - As I've never seen this before.... Why is every echo command prefixed with a $ sign? echo $"bla bla"

It's a bashism - see https://unix.stackexchange.com/questions/48106/ for more. In practical sense - there are only a handful of cases (say 5%?) where they're needed ... even though we use it ~40% of the time. Don't worry pick whichever you're happy with.

AllKind · 2023-12-08T15:54:27Z

Seems like existing tests caught some breakage already. Hazzah for tests catching issues.

Edit: I meant the existing tests here on github.

Yeah, saw that, quickly did a fix. Will think about it more tomorrow.

Cannot see any new tests, not sure if you meant to git add run-tests.sh just yet - if fine if you don't. Although please git rm .dkms.in.swp for the next round.

I did not add any tests yet. I wanted to first check in with you guys, if my changes go a way you are comfortable with.
Yeah, the .swp slipped in on error. Maybe add *.swp to .gitignore?
Also after make git wants to track these:
dkms.service
dkms_autoinstaller
kernel_install.d_dkms
kernel_postinst.d_dkms

Also a case for .gitignore?

evelikov · 2023-12-08T16:02:22Z

I did not add any tests yet. I wanted to first check in with you guys, if my changes go a way you are comfortable with.

Hard to reply here, since the very first part is missing - "define the expectations". The commit message explains what the code does, instead of why. As a whole it doesn't seem to be doing crazy things.

Will open a PR in a second to update .gitignore. Thanks o/

AllKind · 2023-12-08T16:11:32Z

I thought we talked about the "expectations"...
At first my intention was just to add the "broken" status to only zfs status.

You then asked me to expand that to the other actions / functionality.
So now I added that to every action. Means every action now checks if the module/version is in a broken state.
That's now the expectation....

As said I wanted to check with you, if the way I do it, is ok for you.
Including the messages printed.
Once the functionality / code is accepted by you guys, I'll go ahead and write new tests, specifically for the "broken" state.

Cool?

dkms.in

evelikov · 2023-12-08T16:41:35Z

Grr - didn't see the updated manual page, sorry my bad. It covers things afaict.

Left a few specific comments but overall the work is fine

AllKind · 2023-12-10T18:58:45Z

So...
While testing I found an error in is_module_broken(). The second test - symlink is missing, but module source exists - was bad.
Changed that to: [[ ! -L $dkms_tree/$1/$2/source && -d $source_tree/$1-$2/ ]] && return - now using the $source_tree variable.
Which should work from reading the source code and it did work while testing.

I added the tests for the 'broken' status to the best of my knowledge (nothing for the match action - no existing tests to copy from), but the code is the same as in autoinstall()... so...

I'd say this PR is ready for merge.
The github tests fail for a different reason AFAICT. Running the test suite locally succeeds.

This reverts commit c0004f0. Signed-off-by: Mart Frauenlob <[email protected]>

AllKind · 2023-12-14T11:00:33Z

Adjusted the tests to the new output...

Now this is odd:
[Build in containers (alpine, 3.16, -virt)]
https://github.com/dell/dkms/actions/runs/7207768504/job/19635277841?pr=357

Running BROKEN tests

Adding the test module by directory
 Removing symlink /var/lib/dkms/dkms_test/1.0/source
Checking broken status
--- test_cmd_expected_output.log
Error: unexpected output from: dkms_status_grep_dkms_module for dkms_test
+++ test_cmd_output.log
@@ -1,3 +1,3 @@
+dkms_test/1.0: broken
 Error! dkms_test/1.0: Missing the module source directory or the symbolic link pointing to it.
 Manual intervention is required!
-dkms_test/1.0: broken

Same test passes in the Ubuntu VMs.

I do the testing in a Virtualbox VM with Fedora 38.
There I noticed that run_status_with_expected_output reverses the order of stdout and stderr in test_cmd_output.log.
So I wrote the expected output like that.
But looks like in [Build in containers (alpine, 3.16, -virt)] it's actually in the right order... hrmm

If either a dkms module source, or the symbolic link pointing to it is missing, the output of `dkms status` will be messed up. Add a new status called 'broken', which will inform the user about it in a nicely formatted way. is_module_added() was modified to not report a module as added, if not both the source directory and the 'source' symlink exist. do_autoinstall() and run_match() were modified to handle a broken status. They skip that particular module/version combo and continue iterating. The new function module_is_broken_and_die() was introduced to die early on a broken module. Because if in a broken state everything has to be considered volatile, we always die. User intvervention is required to restore a healthy environment. The only exeption is, if only the symbolic link 'source' is missing, the action 'add' can be used to re-add the module. The man page was updated with the new 'broken' status. Tests were added to the test suite. Signed-off-by: Mart Frauenlob <[email protected]>

AllKind · 2023-12-14T12:17:58Z

Ok, worked around that by not using the run_status_with_expected_output() function.
Took me a while to figure it out... but fixed the tests not passing on alpine linux, by using ${KERNEL_VER}.

AllKind · 2023-12-19T22:20:14Z

Silent around here lately...

AllKind · 2024-01-31T11:43:46Z

Any word on this?

evelikov · 2024-01-31T12:57:00Z

Ouch, sorry for letting this slip through the cracks.

Looking through it looks solid - the tests are more extensive than I would have gone for. Thank you.

The easily annoyed user in me would have preferred if "dkms remove the/broken/module" (and by extension unbuild/uninstall) to work, although I'm not !00% sure if that's a good idea.

In the worst case, we can worry about it if people complain.

Thanks again for the work (and prodding me) 👍

evelikov · 2024-01-31T13:12:59Z

Resolved the merge conflict, but could not push back to your branch. Did you have Allow edits and access to secrets by maintainers ticked or was it something broken on my end?

In either case, I've resolved the issues and pushed this PR via the CLI. Closing

AllKind mentioned this pull request Oct 24, 2023

Fix dkms installation of deb packages created with Alien. openzfs/zfs#15415

Merged

13 tasks

AllKind force-pushed the add_broken_status branch from 0c6663f to 935e7af Compare December 3, 2023 13:35

AllKind force-pushed the add_broken_status branch from 935e7af to e4af22f Compare December 8, 2023 14:39

AllKind force-pushed the add_broken_status branch from e4af22f to 153457c Compare December 8, 2023 15:48

evelikov reviewed Dec 8, 2023

View reviewed changes

dkms.in Outdated Show resolved Hide resolved

evelikov reviewed Dec 8, 2023

View reviewed changes

dkms.in Outdated Show resolved Hide resolved

evelikov reviewed Dec 8, 2023

View reviewed changes

dkms.in Outdated Show resolved Hide resolved

AllKind force-pushed the add_broken_status branch 4 times, most recently from ce23f98 to 8565169 Compare December 10, 2023 18:38

AllKind force-pushed the add_broken_status branch from 8565169 to e81f165 Compare December 11, 2023 10:10

Revert "dkms: always use read_conf_or_die"

dfbbef6

This reverts commit c0004f0. Signed-off-by: Mart Frauenlob <[email protected]>

AllKind force-pushed the add_broken_status branch from e81f165 to 939cfec Compare December 14, 2023 10:44

AllKind force-pushed the add_broken_status branch 3 times, most recently from 4b53a1d to 27fc015 Compare December 14, 2023 12:02

AllKind force-pushed the add_broken_status branch from 27fc015 to 9d6c9ab Compare December 14, 2023 12:08

evelikov closed this Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add broken status #357

Add broken status #357

AllKind commented Oct 23, 2023

evelikov commented Oct 24, 2023

AllKind commented Oct 24, 2023

evelikov commented Oct 24, 2023

evelikov commented Oct 24, 2023

evelikov commented Nov 29, 2023

AllKind commented Dec 2, 2023

AllKind commented Dec 3, 2023

evelikov commented Dec 5, 2023 •

edited

Loading

AllKind commented Dec 6, 2023 •

edited

Loading

AllKind commented Dec 6, 2023

evelikov commented Dec 7, 2023

evelikov commented Dec 7, 2023

AllKind commented Dec 8, 2023

evelikov commented Dec 8, 2023

AllKind commented Dec 8, 2023 •

edited

Loading

evelikov commented Dec 8, 2023

AllKind commented Dec 8, 2023

evelikov commented Dec 8, 2023

AllKind commented Dec 10, 2023

AllKind commented Dec 14, 2023

AllKind commented Dec 14, 2023

AllKind commented Dec 19, 2023

AllKind commented Jan 31, 2024

evelikov commented Jan 31, 2024

evelikov commented Jan 31, 2024

Add broken status #357

Add broken status #357

Conversation

AllKind commented Oct 23, 2023

evelikov commented Oct 24, 2023

AllKind commented Oct 24, 2023

evelikov commented Oct 24, 2023

evelikov commented Oct 24, 2023

evelikov commented Nov 29, 2023

AllKind commented Dec 2, 2023

AllKind commented Dec 3, 2023

evelikov commented Dec 5, 2023 • edited Loading

AllKind commented Dec 6, 2023 • edited Loading

AllKind commented Dec 6, 2023

evelikov commented Dec 7, 2023

evelikov commented Dec 7, 2023

AllKind commented Dec 8, 2023

evelikov commented Dec 8, 2023

AllKind commented Dec 8, 2023 • edited Loading

evelikov commented Dec 8, 2023

AllKind commented Dec 8, 2023

evelikov commented Dec 8, 2023

AllKind commented Dec 10, 2023

AllKind commented Dec 14, 2023

AllKind commented Dec 14, 2023

AllKind commented Dec 19, 2023

AllKind commented Jan 31, 2024

evelikov commented Jan 31, 2024

evelikov commented Jan 31, 2024

evelikov commented Dec 5, 2023 •

edited

Loading

AllKind commented Dec 6, 2023 •

edited

Loading

AllKind commented Dec 8, 2023 •

edited

Loading