Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debug_overlay.conf: temporarily disable SOF_BOOT_TEST #8624

Merged
merged 1 commit into from
Dec 14, 2023

Conversation

marc-hb
Copy link
Collaborator

@marc-hb marc-hb commented Dec 13, 2023

  • This test has failed on MTL since it was enabled. Running tests that
    we systematically ignore the failure of is much worse than not running
    them because it provides a false impression of quality.

  • This can cause DSP panics as seen in
    dai_zephyr: Silence benign warnings #8621

Signed-off-by: Marc Herbert [email protected]

- This test has failed on MTL since it was enabled.  Running tests that
  we systematically ignore the failure of is much worse than not running
  them because it provides a false impression of quality.

- This can cause DSP panics as seen in
  thesofproject#8621

Signed-off-by: Marc Herbert <[email protected]>
@marc-hb
Copy link
Collaborator Author

marc-hb commented Dec 14, 2023

No DSP panic in https://sof-ci.01.org/sofpr/PR8624/build1103/devicetest/index.html, even WITH #8621 commit on top.

So CONFIG_SOF_BOOT_TEST really is the smoking gun.

PS: cavs https://sof-ci.01.org/sofpr/PR8624/build1104/devicetest/index.html is also all green and so is https://sof-ci.01.org/sofpr/PR8624/build1102/build

zmain/LNL failure
https://github.com/thesofproject/sof/actions/runs/7203238661/job/19622784805?pr=8624 is known regression zephyrproject-rtos/zephyr#66042


EDIT: 2nd and final test run WITHOUT #8621:

@marc-hb marc-hb marked this pull request as ready for review December 14, 2023 01:29
@marc-hb marc-hb requested a review from lyakh December 14, 2023 01:29
@marc-hb marc-hb requested a review from andyross December 14, 2023 01:33
@marc-hb marc-hb added bug Something isn't working as expected P1 Blocker bugs or important features labels Dec 14, 2023
Copy link
Collaborator

@lyakh lyakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm against disabling it. If the test finds a non-working subsystem, we need to fix or remove that subsystem, not disable the whole test framework.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Dec 14, 2023

I'm against disabling it.

This is temporary and it's a one-line change, pure configuration change. It's not removing any code and preventing anyone from using it, testing it and fixing it.

If the test finds a non-working subsystem,

This test does NOT find anything and will not find anything: it failed every single time on MTL since the very beginning yet we noticed only now (which cost @andyross and myself at least half a day). It's yet another "green failure". That's not a problem with the test itself (it's because we've always ignored FW errors thesofproject/sof-test#1075) but the net effect is the same.

we need to fix or remove that subsystem, not disable the whole test framework.

  1. When can you provide a fix? This is blocking other work.
  2. It will still not find much because we still ignore FW errors Catch firmware errors sof-test#1075

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point @lyakh but given this is the only test, this seems reasonable. If we had a test that would pass, we could only disable one test, but that's not an option now.

@lyakh
Copy link
Collaborator

lyakh commented Dec 14, 2023

I'm against disabling it.

This is temporary and it's a one-line change, pure configuration change. It's not removing any code and preventing anyone from using it, testing it and fixing it.

If we remove those (purposefully annoying) error messages from everybody's sight, the underlying problems will never get fixed.

If the test finds a non-working subsystem,

This test does NOT find anything and will not find anything: it failed every single time on MTL since the very beginning yet we noticed only now (which cost @andyross and myself at least half a day). It's yet another "green failure". That's not a problem with the test itself (it's because we've always ignored FW errors thesofproject/sof-test#1075) but the net effect is the same.

Sorry, what do you mean "it doesn't find anything?" I think it's a valid test and it uncovers a real problem in SOF. And we knew about the failure from the moment that PR was submitted, it was just invisible in CI because the VMH that's failing the test is only available on MTL and MTL logs aren't yet publicly available in CI.

we need to fix or remove that subsystem, not disable the whole test framework.

1. When can you provide a fix? This is blocking other work.

You mean you're working on a CI test that would evaluate firmware logs? I could provide a "fix," removing the part of the VMH functionality that's failing, if that's desired.

2. It will still not find much because we still ignore FW errors [Catch firmware errors sof-test#1075](https://github.com/thesofproject/sof-test/pull/1075)

Originally this boot-time self-test was designed as an initially non-fatal advisory feature - until all found flaws are fixed.

@kv2019i
Copy link
Collaborator

kv2019i commented Dec 14, 2023

@lyakh We need to disable the failing boot test ASAP. The failing test has not caused side-effects before, but now it clearly has, so we can't afford to have it enabled in CI. Please ack this or submit an alternative. And can we make sure there's a bug filed for the failing case, so we can get proper priority to this.

EDIT:

@wszypelt
Copy link

@marc-hb Intel Internal CI tests crashed due to a problem with the verifier, we are in the process of repairing it, and the PR itself should pass in about 40 minutes

@kv2019i kv2019i merged commit c0a8881 into thesofproject:main Dec 14, 2023
41 of 44 checks passed
@marc-hb
Copy link
Collaborator Author

marc-hb commented Dec 14, 2023

it was just invisible in CI because the VMH that's failing the test is only available on MTL and MTL logs aren't yet publicly available in CI.

That's not why it's invisible. It's invisible because no one ever clicks on a green "PASS". There are literally hundreds of PASS/FAIL results for every test run so of course people never look at "PASS" results. That's humanly impossible.

as an initially non-fatal advisory feature

There is no such thing as "advisory" because no one looks at PASS logs.
In fact, after some (fairly short!) time people get used to even FAIL results and stop clicking on them too - which did just happen here with the i915 GSC firmware issue (just fixed) and almost got me to (again) miss the new DSP panic!

You mean you're working on a CI test that would evaluate firmware logs?

Yes I have been (even though green failures were never a priority) but now I'm blocked because there are countless (and hopefully harmless) repetition of someFW errors that no one ever noticed - cause no one opens PASS results. Details in thesofproject/sof-test#1075

@marc-hb marc-hb deleted the disable-boot-test branch December 14, 2023 17:15
@marc-hb
Copy link
Collaborator Author

marc-hb commented Dec 14, 2023

Yes I have been (even though green failures were never a priority)...

In 2020 someone tried to convince me that a test run accidentally missing logs should still be a PASS...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected P1 Blocker bugs or important features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants