-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] DSP panic with Zephyr on Intel MTL, regression 27th June #9268
Comments
|
@ceolin any ideas? |
What does the assembly dump look like for pm_state_set and power_down? |
One thing to clean up in power_down assembly is window exceptions. You're called in a context where register windowing is enabled, which means that any access to registers other than A0-A3 can in principle trap. And the first use of A11+ (the last window, after which no window exceptions will occur) doesn't happen until after you've locked three data and four instruction lines into the cache. I don't see why that's illegal, but if I wanted to place bets on "how to exercise weird core behavior", this would be on the list. Strongly suggest "pre-spilling" all registers by e.g. putting a Also, I note there are two "MOVI" pseudo instructions after you start locking cache lines which are going to end up pointing into an arbitrary literals location. You need an explicit L32R to be sure it lands in valid memory, not a compiler-generated MOVI. |
Oooh, and there are lots more in asm_memory_management.h in contexts where you've already started shutting down memory! I'm going to place my chips on this as the culprit. I give it 60%+ odds. Someone needs to comb through these files and make sure there's no "MOVI" usage (which again to be clear: is only a CPU instruction for small values, for big ones the compiler gets fancy and emits a .literals record for the linker to find and place). I mean, maybe I'll lose the bet and these will turn out to all be valid/non-loading variants. But still I think style would demand this be cleaned up. |
@andyross ouch, yes, that sounds like a problem. Thanks for finding it! Now somebody just needs to fix it... |
@andyross I looked again at these. And TBH I don't see a problem with those specific ones. Here are the lines we're talking about:
These ones - yes, agree, look dangerous... |
@andyross @teburd that was a nice theory, but it doesn't seem to help: zephyrproject-rtos/zephyr#75174 doesn't fix the problem. What exactly did we bet, Andy? ;-) |
Bah. I was sure I had it. Is the panic you're seeing on the DPFL instruction itself? Reading the ISA ref, that's only supposed to happen if:
Possibility 2 seems not... entirely impossible? This gets down to details about the hardware cache layout, which aren't completely clear from core-isa.h. But my read is that MTL has a 48k dcache laid out in three ways with a 16k stride. So if you have two cache lines pinned in the dcache at the same address modulo 16k, you can't add a third. I see the code here adding two essentially unrestricted dcache addresses (the literals and the stack). Is it possible there's another somewhere else, maybe leftover from the ROM? If so, then bad luck with memory layout would (I think) be able to make this happen. [1] Presumably "two" so that there's always one available to populate via normal memory operation |
(Deleted a comment again to avoid confusion: thought I had it, but missed a spot where it's loading the mask values.) FWIW, regarding the earlier point: it wouldn't be too hard in the failing configuration to dump the actual addresses and see if the low 14 bits of the mask and literals regions match (there are two lines in literals). And if that does turn out to be the problem, you could resolve it by reserving space in that literals area and coping the mask words there. That way they live sequentially in memory and can't collide on cache index. |
IPC4 Daily tests are mostly red right now because of this, which means other, unrelated regressions WILL sneak unnoticed. tl;dr: IPC4 CI is dead right now. |
If I have to bet I would put my money here. I was looking the documentation and stumbled in the same implementation notes that you commented. That would explain why moving that variable out of the stack solves the problem. |
@ceolin @andyross this bug is making me rich. Looks like this idea isn't correct either. We've tried various ways to unlock or even to free all cache lines - no success. Maybe we do have to use zephyrproject-rtos/zephyr#75108 as long as we don't have a better solution |
There's also an incorrectness in the Zephyr code at https://github.com/zephyrproject-rtos/zephyr/blob/2c34da96f0e3ba07764db3ac7def9b400bbd1729/soc/intel/intel_adsp/ace/power_down.S#L92-L104 - that code expects |
The power_down() function will lock dcache for the hpsram_mask array. On some platforms, the dcache lock will fail if the array is on cache line that can be used for window register context saves. Work around this by padding the hssram_mask to a full cacheline. Link: thesofproject/sof#9268 Signed-off-by: Kai Vehmanen <[email protected]>
One more idea. The exception seems to be right after first reference to windowed registers set up by call8. This is when the window overflow would be handled. The registers are stored to the stack frame, so these writes may go to the same cache line as we are trying to lock with dpfl (on caller's stack) -- or some other interaction between windowoverflow and dpfl. An ugly change following this hypothesis seems to be holding up -> zephyrproject-rtos/zephyr#75285 -- let's see full results. |
Update Zephyr baseline to 650227d8c47f Change affecting SOF build targets: 32d05d360b93 intel_adsp/ace: power: fix firmware panic on MTL a3835041bd36 intel_adsp/ace: power: Use MMU reinit API on core context restore a983a5e399fd dts: xtensa: intel: Remove non-existent power domains from ACE30 PTL DTS a2eada74c663 dts: xtensa: intel: Remove ALH nodes from ACE 3.0 PTL DTS 442e697a8ff7 dts: xtensa: intel: Reorder power domains by bit position in ACE30 d1b5d7092e5a intel_adsp: ace30: Correct power control register bitfield definitions 31c96cf3957b xtensa: check stack boundaries during backtrace 5b84bb4f4a55 xtensa: check stack frame pointer before dumping registers cb9f8b1019f1 xtensa: separate FATAL EXCEPTION printout into two e9c23274afa2 Revert "soc: intel_adsp: only implement FW_STATUS boot protocol for cavs" 1198c7ec295b Drivers: DAI: Intel: Move ACE DMIC start reset clear to earlier 78920e839e71 Drivers: DAI: Intel: Reduce traces dai_dmic_start() 9db580357bc6 Drivers: DAI: Intel: Remove trace from dai_dmic_update_bits() f91700e62968 linker: nxp: adsp: add orphan linker section Link: thesofproject#9268 Link: thesofproject#9243 Link: thesofproject#9205 Signed-off-by: Kai Vehmanen <[email protected]>
The power_down() function will lock dcache for the hpsram_mask array. On some platforms, the dcache lock will fail if the array is on cache line that can be used for window register context saves. Work around this by aligning and padding the hpsram_mask to cacheline size. Link: thesofproject/sof#9268 Signed-off-by: Kai Vehmanen <[email protected]>
Did you say low maintenance?
Sorry if I'm jumping to conclusions; I couldn't resist :-) |
The power_down() function will lock dcache for the hpsram_mask array. On some platforms, the dcache lock will fail if the array is on cache line that can be used for window register context saves. Work around this by aligning and padding the hpsram_mask to cacheline size. Link: thesofproject/sof#9268 Signed-off-by: Kai Vehmanen <[email protected]> (cherry picked from commit b767597)
The power_down() function will lock dcache for the hpsram_mask array. On some platforms, the dcache lock will fail if the array is on cache line that can be used for window register context saves. Work around this by aligning and padding the hpsram_mask to cacheline size. Link: thesofproject/sof#9268 Signed-off-by: Kai Vehmanen <[email protected]>
The power_down() function will lock dcache for the hpsram_mask array. On some platforms, the dcache lock will fail if the array is on cache line that can be used for window register context saves. Work around this by aligning and padding the hpsram_mask to cacheline size. Link: thesofproject/sof#9268 Signed-off-by: Kai Vehmanen <[email protected]>
The power_down() function will lock dcache for the hpsram_mask array. On some platforms, the dcache lock will fail if the array is on cache line that can be used for window register context saves. Work around this by aligning and padding the hpsram_mask to cacheline size. Link: thesofproject/sof#9268 Signed-off-by: Kai Vehmanen <[email protected]>
The power_down() function will lock dcache for the hpsram_mask array. On some platforms, the dcache lock will fail if the array is on cache line that can be used for window register context saves. Work around this by aligning and padding the hpsram_mask to cacheline size. Link: thesofproject/sof#9268 Signed-off-by: Kai Vehmanen <[email protected]>
The power_down() function will lock dcache for the hpsram_mask array. On some platforms, the dcache lock will fail if the array is on cache line that can be used for window register context saves. Work around this by aligning and padding the hpsram_mask to cacheline size. (cherry picked from commit 2fcdbba) Original-Link: thesofproject/sof#9268 Original-Signed-off-by: Kai Vehmanen <[email protected]> GitOrigin-RevId: 2fcdbba Cr-Build-Id: 8740088176811114337 Cr-Build-Url: https://cr-buildbucket.appspot.com/build/8740088176811114337 Copybot-Job-Name: zephyr-main-copybot-downstream Change-Id: Ia4459bc8d6bbea78f2d1e4a4601d0396b2f3b7ef Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/5776799 Tested-by: ChromeOS Prod (Robot) <[email protected]> Commit-Queue: Jeremy Bettis <[email protected]> Tested-by: Jeremy Bettis <[email protected]> Reviewed-by: Jeremy Bettis <[email protected]>
thesofproject#9268 seems to be back, Zephyr PR 78057 is a new attempt to fix it. Signed-off-by: Guennadi Liakhovetski <[email protected]>
and a new attempt to fix it zephyrproject-rtos/zephyr#78057 |
thesofproject#9268 seems to be back, Zephyr PR 78057 is a new attempt to fix it. Signed-off-by: Guennadi Liakhovetski <[email protected]>
Posted this in the PR by accident, copying here: FWIW: I still bet that if you whiteboxed a rig to enumerate all the dcache indexes and check how many of them can be pinned, we'd discover that something in the boot ROM or elsewhere has left a line pinned accidentally, preventing more pinning at that index, and we're just hitting that by bad luck in the linker. Would be non-trivial assembly to write, and tedious to debug as the only feedback is a panic, but not "hard" hopefully. |
thesofproject#9268 seems to be back, Zephyr PR 78057 is a new attempt to fix it. Signed-off-by: Guennadi Liakhovetski <[email protected]>
@andyross I think what I've tested that back when debugging this last time by doing either one or both of the following things:
Neither had helped |
@lyakh my read is there are three ways, so you need to prove you can lock two lines at each index to saturate. IIRC there are two separate dcache regions in the code in question, right? So if those overlap in index and happen to collide with something already in the cache, then we'd be in a situation where we're trying to lock the same index in all three ways, which is illegal. Maybe. :) |
@andyross sure, but as I said - I also tried unlocking all cache lines and it didn't help either |
@lyakh is everything working now? Can we close this issue? |
@wszypelt the last words are "... and it didn't help either" ? |
@wszypelt yes, so far zephyrproject-rtos/zephyr#78283 has fixed it |
The power_down() function will lock dcache for the hpsram_mask array. On some platforms, the dcache lock will fail if the array is on cache line that can be used for window register context saves. Work around this by aligning and padding the hpsram_mask to cacheline size. Link: thesofproject/sof#9268 Signed-off-by: Kai Vehmanen <[email protected]>
Describe the bug
A DSP panic started showing up in CI runs on 27th June. No individual PR merged recently shows this in test results, so suspcion is a combination of PRs merged. Current suspect:
Both passed tested independently, but in together there seem to be DSP panics.
To Reproduce
https://sof-ci.01.org/sofpr/PR9267/build6051/devicetest/index.html
Reproduction Rate
Very high (most PR CI runs on 27th June)
Expected behavior
No DSP panics
Impact
Blocking CI
Environment
https://sof-ci.01.org/sofpr/PR9265/build6047/devicetest/index.html
https://sof-ci.01.org/sofpr/PR9267/build6051/devicetest/index.html
The text was updated successfully, but these errors were encountered: