Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Introduce arch_send_cpu_stop to halt secondary cores for fatal errors in SMP systems #65143

Closed
wants to merge 2 commits into from

Conversation

quic-lingutla
Copy link
Contributor

On SMP systems, when fatal error occured, there is no common mechanism available to halt secondary cores.
Introduce arch_send_cpu_stop API to halt secondary cores, which gets called in fatal error path (for now arch_system_halt() calls it- can be changed).

As an example, implemented the API for RISC-V architecture, but each ARCH should implement its own mechanism to inform/halt secondary cores.

@quic-lingutla quic-lingutla marked this pull request as ready for review November 14, 2023 16:48
@quic-lingutla quic-lingutla marked this pull request as draft November 14, 2023 16:48
@zephyrbot zephyrbot added area: Kernel area: RISCV RISCV Architecture (32-bit & 64-bit) area: Base OS Base OS Library (lib/os) labels Nov 14, 2023
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes. Not opposed to a general API for this, but I'm not sure this is it either.

The reason this has been left undefined (i.e. "owned by the SOC layer") at the arch and kernel levels is that different architectures have very different ideas about how it works. Some systems can't even stop CPUs individually, some like intel_adsp have to shut them down synchronously (maybe, this is IMHO sort of a wart, but that's how it works right now) in coordination with an external host CPU, etc...

One thing this does skip though, that's really important, is a state predicate. You can tell a CPU to shut down with this API, but what you can't do is know if it's "definitely off" such that it's safe to turn back on without race conditions. I suspect that's because you're looking at this as a fatal error handler, but the problem is bigger than that.

riscv_send_ipi(IPI_SCHED);
}

void arch_send_cpu_stop(void)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I'd call this mechanism something like arch_cpu_stop_async() if you want to discriminate from the "waits for other CPU to halt" variant.

{
printk("Stopping CPU: %d\n", _current_cpu->id);
while (1)
;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming: maybe "arch_cpu_stop_current()" to make it clear how this differs from the other variants.

Obviously this is a stub, but even so: note that this not only doesn't "stop" the CPU, it doesn't actually do anything on its own to prevent the CPU from continuing to work. All it does is cause an infinite loop in whatever thread context called it. We can still schedule other threads and handle interrupts. Put an arch_irq_lock() before it if you want to stub it this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.
But this is called with interrupts disabled right ? (from interrupt handler)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use arch_cpu_idle here in the while loop for power saving

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion, added.

kernel/fatal.c Outdated
@@ -19,6 +19,12 @@

LOG_MODULE_DECLARE(os, CONFIG_KERNEL_LOG_LEVEL);

#ifdef CONFIG_SMP
__weak void arch_send_cpu_stop(void)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, suggest leaving this undefined and not weak. If an app/subsystem requires this API, and the arch layer doesn't provide it, we want the failure to be build-time and not runtime.

In circumstances where we want a fancier mechanism, consider something like "CONFIG_ARCH_HAS_xxx".

arch/riscv/core/smp.c Outdated Show resolved Hide resolved
arch/riscv/core/smp.c Outdated Show resolved Hide resolved
arch/riscv/core/smp.c Outdated Show resolved Hide resolved
On SMP systems, if panic/fatal errors triggered, there is no common
mechanism available to halt other cpus.

Introduce arch_cpu_stop_async API, which gets called by fatal error
handling cpu to halt other cpus in system.

The fatal handling cpu doesn't wait for other cpus to halt.

Signed-off-by: Lingutla Chandrasekhar <[email protected]>
@quic-lingutla quic-lingutla force-pushed the ipi_stop branch 3 times, most recently from 461de69 to 75086d9 Compare November 17, 2023 22:10
@quic-lingutla quic-lingutla marked this pull request as ready for review November 17, 2023 23:22
On fatal errors, fatal handling cpu calls arch_cpu_stop_async to halt
other cpus in system.

Add IPI_CPU_STOP to implement ARCH_HAS_IPI_CPU_STOP support for RISCV
architecture.

Signed-off-by: Lingutla Chandrasekhar <[email protected]>
@andyross
Copy link
Contributor

Still doesn't have state feedback to know a CPU is off (i.e. that it's safe to call arch_start_cpu() again). As mentioned, without that, this isn't really useful for anything but fatal system shutdown. And given that, I'm not sure it has much value as a general kernel API?

Maybe we should continue to provide that as a private API out of the platform layers and call it from an app-provided sys_fatal_error_handler instead? Again, CPU lifecycle behavior is really varied between platforms, I don't think many of us have given much thought to a general solution here, and this is really just a shim for the needs of one app on one device?

@andyross
Copy link
Contributor

@dcpleung and @ceolin might have input, as the intel_adsp core lifecycle is likewise complicated and special case.

@dcpleung
Copy link
Member

TBH, the name arch_send_cpu_stop is kind of misleading here. The actual implementation is simply to send IPI. So maybe rename it to arch_ipi_cpu_stop or something... or better, just have a generic arch_ipi() with some generic value and architecture specific add-ons.

@quic-lingutla
Copy link
Contributor Author

Still doesn't have state feedback to know a CPU is off (i.e. that it's safe to call arch_start_cpu() again). As mentioned, without that, this isn't really useful for anything but fatal system shutdown. And given that, I'm not sure it has much value as a general kernel API?

Ah. my bad. you are right, i coded the API for fatal errors. I will try to have state feedback to turn on again.

Maybe we should continue to provide that as a private API out of the platform layers and call it from an app-provided sys_fatal_error_handler instead? Again, CPU lifecycle behavior is really varied between platforms, I don't think many of us have given much thought to a general solution here, and this is really just a shim for the needs of one app on one device?

As the IPI can be common across targets, so i kept in kernel API.

Copy link

This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time.

@github-actions github-actions bot added the Stale label Jan 31, 2024
@github-actions github-actions bot closed this Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Base OS Base OS Library (lib/os) area: Kernel area: RISCV RISCV Architecture (32-bit & 64-bit) Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants