Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add broken status #357

Closed
wants to merge 2 commits into from
Closed

Add broken status #357

wants to merge 2 commits into from

Conversation

AllKind
Copy link
Contributor

@AllKind AllKind commented Oct 23, 2023

If either a dkms module source, or the symbolic link pointing to it is missing, the output of dkms status will be messed up. Add a new status called 'broken', which will inform the user about it in a nicely formatted way.

A recently discovered error in the ZFS installation routine lead to this issue: openzfs/zfs#15336

The dkms module sources were deleted, but the files in the dkms tree were still there.
Therefor the symbolic link 'source' was dangling.

This led to dkms status output like this:


dkms status
nvidia/525.125.06, 5.15.0-87-generic, x86_64: installed
nvidia/525.125.06, 5.15.135-custom, x86_64: installed
nvidia/525.125.06, 5.15.136-custom, x86_64: installedError! Could not locate dkms.conf file.
File: /var/lib/dkms/zfs/2.1.13/source/dkms.conf does not exist.

zenpower/0.1.12, 5.15.0-87-generic, x86_64: installed
zenpower/0.1.12, 5.15.135-custom, x86_64: installed
zenpower/0.1.12, 5.15.136-custom, x86_64: installed
dkms status
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/zfs/2.1.12/source/dkms.conf does not exist.

With this patch, the output will be like this:

dkms status 
zfs/2.1.13: broken
 - Missing module source directory, or the symbolic link pointing to it

Things, which are IMHO for discussion are (of course only in case the change gets accepted):
If the "Missing ..." message should be at the same line as "broken", or on a line on its own.
For the latter possibly with sent to stderr. Which would have the advantage eventually not to break possibly existing scripts, which parse the output of dkms status.

Just as an idea... In the future other things may be checked for a healthy state. I.e. the installed modules. If found broken, reported as such with this new state.
I didn't look into the codebase more deeply yet, to propose something more concrete. But maybe I will. If I come up with something reasonable, I'll put it up for discussion.

Have a nice day!

@evelikov
Copy link
Collaborator

We just landed a PR to make the error fatal - see #354 and related discussion #352.

From what I can see in the zfs issue, it seems like the release/package was completely broken leading to the problem. Is that correct?

Overall I like the idea, although am cautious for regressions in the XXXs users out there parsing dkms status output. That said, currently such users will also choke as they see the example you posted. So I believe we should be safe on that front.

Some requests:

  • Add separate commit reverting my earlier PR
  • As you pointed out, the error line should really be send to stderr
  • Please document the new status in the manual
  • Can you add some tests - especially for operations that feed off the (internal) dkms status

Glancing through the codebase - match, autoinstall, any non status action with --all, uninstall all use a variation of status. There are bunch of tests to serve as examples for all but match, so don't worry about that one.

Thanks

@AllKind
Copy link
Contributor Author

AllKind commented Oct 24, 2023

From what I can see in the zfs issue, it seems like the release/package was completely broken leading to the problem. Is that correct?

Various reasons from what I understand.

-: dkms status output changed at some point, which broke the package scriptlet.
-: zfs config.h not packaged properly, which also broke the package scriptlet.
-: The proposed fix used the RPM scriptlet hook %posttrans, which broke the debian packages.

All lead to various problems, when installing, re-installing, upgrading or removing.


Tomorrow I will be gone for 10 days.
After that I will look into the things you said.

@evelikov
Copy link
Collaborator

IIRC the dkms status did change a while back, so we did good up on that front.

Thanks for the work unwrapping this. Enjoy you time AFK o/

@evelikov
Copy link
Collaborator

Fwiw at a later point, if you feel like adding some zfs tests into our CI that would be deeply appreciated, although is entirely optional.

@evelikov
Copy link
Collaborator

@AllKind did you have the time to get back to this? Introducing a "broken" status is reasonable, although we'd need to update the documentation and add some test(s) as mentioned above. Thanks

@AllKind
Copy link
Contributor Author

AllKind commented Dec 2, 2023

@evelikov
Sorry for the delay. I got sick and then got buried in other things.
It's still on my radar. I will try to find time the next days.

@AllKind
Copy link
Contributor Author

AllKind commented Dec 3, 2023

@evelikov
I updated this PR according to your feedback.

1: I reverted the requested commit. It makes sense to not interrupt at dkms status, as other modules may be installed.

2: The broken status is displayed to the user to stdout as all the other states. Plus an extra line to stderr, which describes the problem.

3: I took a look at the internal functions. I added a check for the 'broken' status in do_autoinstall(). Also I modified is_module_added(), to only report a module as added, if both the source directory and the symlink pointing to it exist.
As far as I can see, that should cover it, as all other functions die at read_conf(). But maybe I missed something.
Review needed! ;-)

4: The man page was updated.

Regarding adding tests, I don't know what to do there. Some more details would be useful.
For now I just took a glance at run_test.sh ... lot's to read there.
Anyhow, if I do that, it could be in a different PR.

@evelikov
Copy link
Collaborator

evelikov commented Dec 5, 2023

Welcome back o/

Thinking about the tests, here are some rough pseudo-code ideas:

  • non-status action that supports "--all" - aka build/install, remove, unbuild, uninstall
  1. dkms add one-in-tree-demo-module X
  2. dkms add another-in-tree-demo-module Y
  3. "break" module Y
  4. for module in X Y; dkms build --all $module
  5. dkms status observe/match the broken module
  6. for module in X Y dkms remove --all $module
  • autoinstall

Copy/edit the existing autoinstall test to:

  1. "break" one (or all if you prefer) modules
  2. dkms autoinstall ... existing test sets -k which is undocumented and should not allow "--all" (couple of issues if you'd want to help)
  3. dkms status observe/match the broken module and newly installed ones
  • match

No ideas ... if you can come up with any, that'll be great although optional.

I would *request that we elaborate/define in the manual page how a broken module affects the different actions.
At a glance, it seems like we want to behave as-if the module isn't there for all actions but "remove". Where the latter will remove the broken module. If you think that behaviour should be different or user-controllable that also works for me.

NOTE: build --all and install --all are broken/disabled atm, so you can use -k/-a if we don't fix that soon (tm).

@AllKind
Copy link
Contributor Author

AllKind commented Dec 6, 2023

I would *request that we elaborate/define in the manual page how a broken module affects the different actions.
At a glance, it seems like we want to behave as-if the module isn't there for all actions but "remove". Where the latter will remove the broken module. If you think that behaviour should be different or user-controllable that also works for me.

Broken in this PR means, that either the module source (and the dkms.conf) is missing, or the symlink 'source' pointing to it.
If the module source is missing, add, build, install, etc. of course cannot work.
If only the symlink is missing, in theory one could re-create it. But to reach the broken state things must have gone bad.
So my opinion is, user intervention is needed. And safer than to automatically try to fix something, where there isn't really a way to tell what is wrong. Which also introduces the risk that the user is not noticing a problem, or masking the root problem, if in a later step we run into an error.

@AllKind
Copy link
Contributor Author

AllKind commented Dec 6, 2023

One thing I wanted to point out is:
[[ -L $dkms_tree/$1/$2/source || -d $dkms_tree/$1/$2/source ]]; was the check in is_module_added().
This was flawed already. It's an OR condition (which allowed the broken state in the first place). I changed it to be an AND condition. That's also the reason the is_module_broken check needs to run before is_module_added.

@evelikov
Copy link
Collaborator

evelikov commented Dec 7, 2023

Broken in this PR means, that either the module source (and the dkms.conf) is missing, or the symlink 'source' pointing to it.
If the module source is missing, add, build, install, etc. of course cannot work.

That's my understanding as well.

If only the symlink is missing, in theory one could re-create it. But to reach the broken state things must have gone bad.
So my opinion is, user intervention is needed. And safer than to automatically try to fix something, where there isn't really a way to tell what is wrong. Which also introduces the risk that the user is not noticing a problem, or masking the root problem, if in a later step we run into an error.

Fully agreed, let's avoid silently fixing things.

@evelikov
Copy link
Collaborator

evelikov commented Dec 7, 2023

In case it wasn't obvious - it's fine to issue dkms build or install on a broken module. The test can track the error code and message - see run_with_expected_error.

@AllKind
Copy link
Contributor Author

AllKind commented Dec 8, 2023

Just pushed a new version with a lot of changes. Description is in the updated commit message.
It's right now completely untested - I'll see if the automated tests turn up things.
No more time left today, I'll try to continue tomorrow.

I'd just like to confirm if the changes go into the right direction - conceptual wise.

Also I have two questions:

1 - About the exit states on die() - Could not find any documentation in the source. What's the conventions?

2 - As I've never seen this before.... Why is every echo command prefixed with a $ sign? echo $"bla bla"

@evelikov
Copy link
Collaborator

evelikov commented Dec 8, 2023

It's right now completely untested - I'll see if the automated tests turn up things.

Seems like existing tests caught some breakage already. Hazzah for tests catching issues.

Cannot see any new tests, not sure if you meant to git add run-tests.sh just yet - if fine if you don't. Although please git rm .dkms.in.swp for the next round.

I'd just like to confirm if the changes go into the right direction - conceptual wise.

Instead of looking at the fixes/code alone, can I suggest opting for another route - TDD:

  • define the expectations - what should fail and when, what should succeed
  • write some tests that enforce/validate that behaviour
  • churn through the code until the tests are happy

1 - About the exit states on die() - Could not find any documentation in the source. What's the conventions?

As general rule - code should use die and never exit. The err codes are somewhat arbitrarily picked - the only ones that truly matter are 0 (success) and 77 (skip).

2 - As I've never seen this before.... Why is every echo command prefixed with a $ sign? echo $"bla bla"

It's a bashism - see https://unix.stackexchange.com/questions/48106/ for more. In practical sense - there are only a handful of cases (say 5%?) where they're needed ... even though we use it ~40% of the time. Don't worry pick whichever you're happy with.

@AllKind
Copy link
Contributor Author

AllKind commented Dec 8, 2023

Seems like existing tests caught some breakage already. Hazzah for tests catching issues.

Edit: I meant the existing tests here on github.

Yeah, saw that, quickly did a fix. Will think about it more tomorrow.

Cannot see any new tests, not sure if you meant to git add run-tests.sh just yet - if fine if you don't. Although please git rm .dkms.in.swp for the next round.

I did not add any tests yet. I wanted to first check in with you guys, if my changes go a way you are comfortable with.
Yeah, the .swp slipped in on error. Maybe add *.swp to .gitignore?
Also after make git wants to track these:
dkms.service
dkms_autoinstaller
kernel_install.d_dkms
kernel_postinst.d_dkms

Also a case for .gitignore?

@evelikov
Copy link
Collaborator

evelikov commented Dec 8, 2023

I did not add any tests yet. I wanted to first check in with you guys, if my changes go a way you are comfortable with.

Hard to reply here, since the very first part is missing - "define the expectations". The commit message explains what the code does, instead of why. As a whole it doesn't seem to be doing crazy things.

Will open a PR in a second to update .gitignore. Thanks o/

@AllKind
Copy link
Contributor Author

AllKind commented Dec 8, 2023

I thought we talked about the "expectations"...
At first my intention was just to add the "broken" status to only zfs status.

You then asked me to expand that to the other actions / functionality.
So now I added that to every action. Means every action now checks if the module/version is in a broken state.
That's now the expectation....

As said I wanted to check with you, if the way I do it, is ok for you.
Including the messages printed.
Once the functionality / code is accepted by you guys, I'll go ahead and write new tests, specifically for the "broken" state.

Cool?

dkms.in Outdated Show resolved Hide resolved
dkms.in Outdated Show resolved Hide resolved
dkms.in Outdated Show resolved Hide resolved
@evelikov
Copy link
Collaborator

evelikov commented Dec 8, 2023

Grr - didn't see the updated manual page, sorry my bad. It covers things afaict.

Left a few specific comments but overall the work is fine

@AllKind AllKind force-pushed the add_broken_status branch 4 times, most recently from ce23f98 to 8565169 Compare December 10, 2023 18:38
@AllKind
Copy link
Contributor Author

AllKind commented Dec 10, 2023

So...
While testing I found an error in is_module_broken(). The second test - symlink is missing, but module source exists - was bad.
Changed that to: [[ ! -L $dkms_tree/$1/$2/source && -d $source_tree/$1-$2/ ]] && return - now using the $source_tree variable.
Which should work from reading the source code and it did work while testing.

I added the tests for the 'broken' status to the best of my knowledge (nothing for the match action - no existing tests to copy from), but the code is the same as in autoinstall()... so...

I'd say this PR is ready for merge.
The github tests fail for a different reason AFAICT. Running the test suite locally succeeds.

This reverts commit c0004f0.

Signed-off-by: Mart Frauenlob <[email protected]>
@AllKind
Copy link
Contributor Author

AllKind commented Dec 14, 2023

Adjusted the tests to the new output...

Now this is odd:
[Build in containers (alpine, 3.16, -virt)]
https://github.com/dell/dkms/actions/runs/7207768504/job/19635277841?pr=357

Running BROKEN tests

Adding the test module by directory
 Removing symlink /var/lib/dkms/dkms_test/1.0/source
Checking broken status
--- test_cmd_expected_output.log
Error: unexpected output from: dkms_status_grep_dkms_module for dkms_test
+++ test_cmd_output.log
@@ -1,3 +1,3 @@
+dkms_test/1.0: broken
 Error! dkms_test/1.0: Missing the module source directory or the symbolic link pointing to it.
 Manual intervention is required!
-dkms_test/1.0: broken

Same test passes in the Ubuntu VMs.

I do the testing in a Virtualbox VM with Fedora 38.
There I noticed that run_status_with_expected_output reverses the order of stdout and stderr in test_cmd_output.log.
So I wrote the expected output like that.
But looks like in [Build in containers (alpine, 3.16, -virt)] it's actually in the right order... hrmm

@AllKind AllKind force-pushed the add_broken_status branch 3 times, most recently from 4b53a1d to 27fc015 Compare December 14, 2023 12:02
If either a dkms module source, or the symbolic link pointing to
it is missing, the output of `dkms status` will be messed up.
Add a new status called 'broken', which will inform the user
about it in a nicely formatted way.
is_module_added() was modified to not report a module as added,
if not both the source directory and the 'source' symlink exist.
do_autoinstall() and run_match() were modified to handle a broken
status. They skip that particular module/version combo and continue
iterating.
The new function module_is_broken_and_die() was introduced to die
early on a broken module.
Because if in a broken state everything has to be considered volatile,
we always die. User intvervention is required to restore a healthy
environment.
The only exeption is, if only the symbolic link 'source' is missing,
the action 'add' can be used to re-add the module.
The man page was updated with the new 'broken' status.
Tests were added to the test suite.

Signed-off-by: Mart Frauenlob <[email protected]>
@AllKind
Copy link
Contributor Author

AllKind commented Dec 14, 2023

Ok, worked around that by not using the run_status_with_expected_output() function.
Took me a while to figure it out... but fixed the tests not passing on alpine linux, by using ${KERNEL_VER}.

@AllKind
Copy link
Contributor Author

AllKind commented Dec 19, 2023

Silent around here lately...

@AllKind
Copy link
Contributor Author

AllKind commented Jan 31, 2024

Any word on this?

@evelikov
Copy link
Collaborator

Ouch, sorry for letting this slip through the cracks.

Looking through it looks solid - the tests are more extensive than I would have gone for. Thank you.

The easily annoyed user in me would have preferred if "dkms remove the/broken/module" (and by extension unbuild/uninstall) to work, although I'm not !00% sure if that's a good idea.

In the worst case, we can worry about it if people complain.

Thanks again for the work (and prodding me) 👍

@evelikov
Copy link
Collaborator

Resolved the merge conflict, but could not push back to your branch. Did you have Allow edits and access to secrets by maintainers ticked or was it something broken on my end?

In either case, I've resolved the issues and pushed this PR via the CLI. Closing

@evelikov evelikov closed this Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants