-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Introduce arch_send_cpu_stop to halt secondary cores for fatal errors in SMP systems #65143
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes. Not opposed to a general API for this, but I'm not sure this is it either.
The reason this has been left undefined (i.e. "owned by the SOC layer") at the arch and kernel levels is that different architectures have very different ideas about how it works. Some systems can't even stop CPUs individually, some like intel_adsp have to shut them down synchronously (maybe, this is IMHO sort of a wart, but that's how it works right now) in coordination with an external host CPU, etc...
One thing this does skip though, that's really important, is a state predicate. You can tell a CPU to shut down with this API, but what you can't do is know if it's "definitely off" such that it's safe to turn back on without race conditions. I suspect that's because you're looking at this as a fatal error handler, but the problem is bigger than that.
arch/riscv/core/smp.c
Outdated
riscv_send_ipi(IPI_SCHED); | ||
} | ||
|
||
void arch_send_cpu_stop(void) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I'd call this mechanism something like arch_cpu_stop_async()
if you want to discriminate from the "waits for other CPU to halt" variant.
arch/riscv/core/smp.c
Outdated
{ | ||
printk("Stopping CPU: %d\n", _current_cpu->id); | ||
while (1) | ||
; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming: maybe "arch_cpu_stop_current()" to make it clear how this differs from the other variants.
Obviously this is a stub, but even so: note that this not only doesn't "stop" the CPU, it doesn't actually do anything on its own to prevent the CPU from continuing to work. All it does is cause an infinite loop in whatever thread context called it. We can still schedule other threads and handle interrupts. Put an arch_irq_lock() before it if you want to stub it this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
But this is called with interrupts disabled right ? (from interrupt handler)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use arch_cpu_idle
here in the while loop for power saving
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestion, added.
kernel/fatal.c
Outdated
@@ -19,6 +19,12 @@ | |||
|
|||
LOG_MODULE_DECLARE(os, CONFIG_KERNEL_LOG_LEVEL); | |||
|
|||
#ifdef CONFIG_SMP | |||
__weak void arch_send_cpu_stop(void) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, suggest leaving this undefined and not weak. If an app/subsystem requires this API, and the arch layer doesn't provide it, we want the failure to be build-time and not runtime.
In circumstances where we want a fancier mechanism, consider something like "CONFIG_ARCH_HAS_xxx".
On SMP systems, if panic/fatal errors triggered, there is no common mechanism available to halt other cpus. Introduce arch_cpu_stop_async API, which gets called by fatal error handling cpu to halt other cpus in system. The fatal handling cpu doesn't wait for other cpus to halt. Signed-off-by: Lingutla Chandrasekhar <[email protected]>
461de69
to
75086d9
Compare
On fatal errors, fatal handling cpu calls arch_cpu_stop_async to halt other cpus in system. Add IPI_CPU_STOP to implement ARCH_HAS_IPI_CPU_STOP support for RISCV architecture. Signed-off-by: Lingutla Chandrasekhar <[email protected]>
75086d9
to
c87f28c
Compare
Still doesn't have state feedback to know a CPU is off (i.e. that it's safe to call arch_start_cpu() again). As mentioned, without that, this isn't really useful for anything but fatal system shutdown. And given that, I'm not sure it has much value as a general kernel API? Maybe we should continue to provide that as a private API out of the platform layers and call it from an app-provided sys_fatal_error_handler instead? Again, CPU lifecycle behavior is really varied between platforms, I don't think many of us have given much thought to a general solution here, and this is really just a shim for the needs of one app on one device? |
TBH, the name |
Ah. my bad. you are right, i coded the API for fatal errors. I will try to have state feedback to turn on again.
As the IPI can be common across targets, so i kept in kernel API. |
This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time. |
On SMP systems, when fatal error occured, there is no common mechanism available to halt secondary cores.
Introduce arch_send_cpu_stop API to halt secondary cores, which gets called in fatal error path (for now arch_system_halt() calls it- can be changed).
As an example, implemented the API for RISC-V architecture, but each ARCH should implement its own mechanism to inform/halt secondary cores.