Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kpb: Fix draining task and LL conflict #8936

Merged
merged 1 commit into from
Mar 22, 2024

Conversation

serhiy-katsyuba-intel
Copy link
Contributor

KPB draining task is executed by low priority preemptible EDF thread. The task accesses KPB sink buffer and calls comp_copy() for component connected to sink. If LL thread preempts draining task that could result in broken state of sink buffer or component connected to sink. This fix prevents LL from preempting draining task in bad moment.

This fixes CI failures on test_003_01_kd_topology_host_gw[4ch-24in32bit-].

* Hence k_sched_lock() is used temporary to block LL from preempting EDF task thread.
*/
k_sched_lock();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this break the concept of LL? The draining part could take quite a while for a large buffer, and even with this cycle of lock/unlock the entire premise of LL executing within 1ms is broken.

That's a priority inversion in my book where a low-priority task prevents a high-priority/low-latency one from being executed...

Why can't we fix the broken state of the sink instead ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On each loop iteration draining task basically just calls comp_copy() on component connected to KPB sink. So each loop iteration should take much less then 1 ms.
If timer interrupt increased semaphore on which LL thread waits, immediately at k_yield() scheduler will switch to executing LL thread since LL thread has higher priority then EDF task thread. Or am I overlooking something?

Looks like original draining task code was written in pre-Zephyr era. And probably the code works or worked without Zephyr just fine. However, the code seems were never properly ported to Zephyr. Original draining task implementation has multiple problems:

  • there is no sleep/yield in loop iteration and so task blocks IPC thread for a long time (several seconds in our test) and waists power (CPU cannot reach idle between loop iterations).
  • LL can preempt draining task and call comp_copy() for same component or access/modify same buffer/stream.

This fix is an attempt to fix CI failure, not to do proper reimplementation of draining task. Without the fix we have random failures on KPB test. We have to remove test temporary just not to block all the work on SOF.

The test fails with draining task runs forever because KPB sink audio_stream has broken state: avail == free == 0. Such state normally cannot happen as those values are always calculated as avail = size - free and free = size - avail. avail and free simultaneously equal to 0 can only happen when LL preempts draining task in unlucky moment.

The fix passes the test. It has no intent to re-implement draining task properly: I'm not sure if KPB is used in any production case other then demo. The intent is to fix CI and return back KPB test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like you said no one is using KPB in any firmware based on Zephyr (post 2.2).

I am not getting the idea of 'fixing CI' which consists in re-testing a broken solution based on a workaround impacting other usages.

Is KPB is so badly broken, it needs to be disabled and redone, no?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's value in keeping kpb tests. This exercizes a certain type of flow (running an EDF task) in CI that we don't have many others in CI and if we remove the test, this code will bitrot very quickly. The workaround here is local to kpb, so I see no harm outside kpb.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kv2019i "The workaround here is local to kpb" -> not really, it would impact any other compound use case with e.g. speaker protection running.

If KPB is a test component, them let's mark it as non-default and opt-in for CI purposes. This shouldn't go in any releases IMHO.

KPB draining task is executed by low priority preemptible EDF thread.
The task accesses KPB sink buffer and calls comp_copy() for component
connected to sink. If LL thread preempts draining task that could result
in broken state of sink buffer or component connected to sink. This fix
prevents LL from preempting draining task in bad moment.

Signed-off-by: Serhiy Katsyuba <[email protected]>
Copy link
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking this PR as 'Request Changes' so that my comment on the conceptual issue is addressed. Is this really what we want to do?

This looks like a slippery slope towards an 'almost low latency' and 'your mileage may vary' real-time scheduling.

@lgirdwood @kv2019i FYI.

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand @plbossart your concern, but I also see value in keeping this in CI. See my rationale inline. We do need to do something as this is currently slowing down e.g. backporting critical fixes for SOF2.9.

* Hence k_sched_lock() is used temporary to block LL from preempting EDF task thread.
*/
k_sched_lock();

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's value in keeping kpb tests. This exercizes a certain type of flow (running an EDF task) in CI that we don't have many others in CI and if we remove the test, this code will bitrot very quickly. The workaround here is local to kpb, so I see no harm outside kpb.

@kv2019i
Copy link
Collaborator

kv2019i commented Mar 15, 2024

@mwasko @wszypelt Can you ack the firmware CI aspect (see Pierre's comment above).

@mwasko
Copy link
Contributor

mwasko commented Mar 20, 2024

@mwasko @wszypelt Can you ack the firmware CI aspect (see Pierre's comment above).

@kv2019i the KPB tests should remain in CI as it is one of the features that is required for Windows.

@serhiy-katsyuba-intel , @plbossart this change should be treated as workaround, because by design KPB should be interrupted by LL. My understanding is that for the sake of CI as a short term solution we will prevent LL from preempting draining task, but the long term goal should be to fix KPB and draining implementation (adapt to DP).

@kv2019i
Copy link
Collaborator

kv2019i commented Mar 22, 2024

Filed #8977 to follow-up, merging this now.

@kv2019i kv2019i merged commit 6f33aee into thesofproject:main Mar 22, 2024
40 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants