-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel: spinlock: Ticket spinlocks #61541
Conversation
8bdf6cb
to
2b8d641
Compare
Very clever. Am I correct in understanding that in the ticket spinlock model, the |
Hi @peter-mitsis , Indeed. Currently "locked" field is some kind of a compromise to pass the tests: in tests/kernel/spinlock we address this field directly (which is a bad idea, imho) to verify the spinlock state. P.S.: We probably shouldn't access the spinlock internals directly in tests/kernel/spinlock and introduce k_spin_is_locked() API. But I don't have a firm opinion here since such API would only be used for the tests. So I just left it this way |
2b8d641
to
0b3a00d
Compare
Hi All, Currently we have a test failure due to the following spinlock API extension in the module, which is external to Zephyr (modules/audio/sof/zephyr/include/rtos/spinlock.h):
What's our policy about it? Should we bring this API to Zephyr, or should we extend it in SOF? |
0b3a00d
to
ff52b27
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First off, please check the correctness. This doesn't look right to me, but lockless algorithms are REALLY hard, so it's possible this will just need a round or seven of explanation.
Designwise: No complaints about having this in the stable of synchronization primitives, but absolutely not as the default for everyone. A spinlock is the minimum-overhead synchronization choice for routine mostly-uncontended critical sections. The kernel users don't want fairness, they want performance about 99% of the time.
Can you refactor as a k_fair_spinlock or whatnot?
include/zephyr/spinlock.h
Outdated
atomic_t next_ticket; | ||
atomic_val_t next_ticket_val; | ||
|
||
atomic_set(&next_ticket, ticket_val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really suspicious to me. You're getting the value of "owner" and then writing it unconditionally to "next_ticket" with no synchronization. The following atomic_inc() is clearly not atomic, because the write is unconditional.
That is: what happens if you have two threads in exactly this code racing against each other? You can draw paths through the execution where next_ticket is incremented either once or twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @andyross,
Here we just estimate how the l->owner and l->tail of the free spinlock would look like for the paricular l->owner. We don't really change the spinlock state until the CAS, we keep the estimation values on stack.
The atomic operations before the atomic_cas are done only for the sake of the values consistency. We can't just increment ticket_val since it's not how we change real l->owner and l->tail in k_spin_lock and k_spin_unlock:
the estimation should follow the same steps as the real algorithm where we use atomic operations on these fields.
The contending threads indeed may have the same ticket_val and next_ticket_val, but only the fastest thread will acquire the spinlock with
if (!atomic_cas(&l->tail, ticket_val, next_ticket_val)) {
goto busy;
}
The second thread will not met the CAS condition and will return -EBUSY.
Moreover, we are not entirely fair here since in case the spinlock will be acquired AFTER we "freeze" the ticket_val with atomic_get(&l->owner) and released BEFORE we come to atomic_cas we still will identify the spinlock as locked. So sometimes k_spin_trylock will fail with ticket spinlocks in condition when original spinlock will succeed. This is a side effect of having 2 atomic variables instead of 1 for ticket spinlocks.
FWIW: that SOF thing looks like a mistake. Spinlocks don't have an initialization API, they're intended to be zero-filled as initialization so they can be used out of .bss before initialization hooks can be called etc... But if SOF really wants that we can provide a K_SPINLOCK_INIT token or something that just expands to an zero-filled struct. |
Hi @andyross,
P.S.: this concept is not my invention, Linux has passed this path long time ago: https://lwn.net/Articles/267968/ |
Thanks! Then let's make a relative change to SOF. I will post a link here once I open a review there |
Yeah, for sure let's merge this as a separate API while it stabilizes. Adding a bunch of macros to do the switch at some point in the future is trivial, but even then I think that most apps that hit these concerns probably do want control over which locks get the treatment (because, even in Linux, the overwhelming majority of locks aren't contended). Can you submit a bug on that live lock situation you hit? I'm betting it's going to root cause to something closer to a scheduler mistake, or be something that a tweaked arch_spin_relax() can treat more simply. And even if that's not possible, I think I'd be a lot more OK with a patch "replace the scheduler lock with a ticket spinlock" (especially if it came with a bug report and measurements!) than "all spinlocks are now ticket locks". Basically, we don't run in contention environments anything like what Linux deals with. Zephyr platforms tend to care a lot more about code size and simplicity[1] than about this kind of optimization, and we don't have the depth of community expertise to draw on to maintain this stuff. The barrier to "let's put a novel lockless algorithm[2] right into the middle of the most important primitive in the kernel" is a ton higher for us than for Linux. I don't have time to puzzle through the explanation about the trylock code, will get to that soon. (Though my first thought on scanning your text is "Oh, it's a guess at what we'll get" -- so maybe call the variable in question something like "guess"?). [1] e.g. Think of all the CI overhead it would be testing both of these spinlock variants to the same level of coverage! [2] Note that we just got support for standard memory barrier primitives about two months ago, and I'm willing to bet that the atomics layer is still going to have warts on some platforms (arm64 is the poster child for surprising out-of-order semantics, check there for sure) that would break this. I realize @carlocaione isn't on the review list and should be; barriers are his thing. |
ff52b27
to
089bf25
Compare
089bf25
to
a8caf4a
Compare
Hi @andyross , @carlocaione , What has been done to the moment:
@andyross , I have some doubts when I think about the separate API for this implementation. Here is what troubles me: Could we have a configuration option and keep the ticket spinlocks disabled by default? This seems to be safe, if probably I'm missing something.. What do you say? P.S.: the latest build failure seems not to be related to my changes (error in some device tree?)
|
It may depend on environment. On my box, when unloaded, it's 90% reliable or so. Most likely the problem is that host scheduling is failing to run the two threads simultaneously, and obviously a contention test only works if both CPUs are actually running. Having one halt magically because of the host scheduler is going to look very unfair. This may be hard to tune (e.g.: wait for some kind of "the other side is running" detection, then run for just a short time ("significantly less than CONFIG_HZ on the host kernel"), then repeat. Something like that.
Right. More bluntly: none of my complaints rise to a level where I'd refuse it based on the code alone[1]. Prove to me that this runs on all our SMP platforms and doesn't break anything and I'll shut up and +1. [1] The code itself looks fine, though I do think it wants some attention if we're going to be serious about this long term: rollover protection for sure, probably also 16 bit counters a-la Linux. (Or 15 and roll the locked bit into the same word), etc... |
ccb4a12
85a1365
to
ccb4a12
Compare
Basic spinlock implementation is based on single atomic variable and doesn't guarantee locking fairness across multiple CPUs. It's even possible that single CPU will win the contention every time which will result in a live-lock. Ticket spinlocks provide a FIFO order of lock aquisition which resolves such unfairness issue at the cost of slightly increased memory footprint. Signed-off-by: Alexander Razinkov <[email protected]>
Added test to verify the spinlock acquisition fairness in relation to the CPUs contending for the spinlock. This test is only enabled for Ticket Spinlocks which required to provide such kind of fairness. Signed-off-by: Alexander Razinkov <[email protected]>
Added test to verify that the value of atomic_t will be the same in case of overflow if incremented in atomic and non-atomic manner Signed-off-by: Alexander Razinkov <[email protected]>
ccb4a12
to
c5d6d42
Compare
Hi @andyross , @cfriedt , @nashif , @evgeniy-paltsev , @npitre , @carlescufi , @carlocaione , as was discussed on our Architecture WG meeting I've enabled CONFIG_TICKET_SPINLOCKS by default for qemu_riscv64_smp platform. I've also prolonged a busy loop imitating some work within the test to help the threads synchronization which is critical for QEMU where a contending thread could be scheduled out in any moment which results in an artificial unfairness. This is not a 100% robust way to avoid the artificial unfairness, but let's see how it goes. P.S.: QEMU has more advanced and precise emulation mode, CONFIG_QEMU_ICOUNT, but it is disabled for qemu_riscv64_smp and when I try to enable it the platform build fails. |
My memory from the last time we went through this was that icount provides deterministic emulation for only one core; in SMP the two CPUs still run in separate host threads, so it wouldn't address the problem. |
FYR, just put some test reports here in case it might help. I tested it on a simulation platform fvp_baser_aemv8r_smp (Arm64 SMP 4 cores) which is a 100% deterministic platform, and all cores are simulated in only one host thread (at least one host core). With
For comparison,
However, it's worth noting that FVP simulates all cores in a single host core, which means the host core might simulate e.g. ~1000 instructions per core sequentially (for example, we have core 0 ~ core 3, the host core simulates 1000 instructions for Core0 firstly, then for core1 and ...). That's why it's so "unfair" when |
Jaxson Han wrote:
real hardware executes instructions simultaneously
While this is true, different cores can't access the same memory
simultaneously. And some memory might be cached on one core while access
to that memory is put behind a transaction queue on another core putting
it at a disadvantage. This is even systematic on NUMA architectures.
Therefore your test results are pretty representative of what might
happen on real hardware.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spent some time today with twister and the six[1] SMP emulation platforms. The upshot is that the spinlock test is much improved, but still failing significantly more than other tests[2] in a loaded CI run. Roughly half the failures in any run are this test:
- A full twister suite with default kconfig saw 10 failures
- Two full suite runs with tickets set to =y globally saw 19 the first time and 10 the second
Breaking out the spinlock test specifically and running it on an unloaded system, though, looks much better: one failure in 60 twister iterations (across 15 test binaries each, so 1 in 900). Given that the test is known to be able to fail stochastically in emulation, that seems actually OK to me.
Basically it looks good to me from an emulation perspective. I'm going to put a +1 on it as it's seems extremely unlikely to break anything at the integration level.
I would still love to see someone:
- Get some results from a full suite on a real SMP hardware platform
- Report some benchmarking numbers (not just the fairness results) on real hardware
[1] twister -p qemu_cortex_a53_smp -p qemu_riscv32_smp -p qemu_riscv64_smp -p qemu_x86_64 -p qemu_x86_64_nokpti
, which as of this PR runs 1874 individual test binaries.
[2] Though an interesting second place I hadn't noticed was tests/subsys/zbus/integration. This dies once in almost every run, and twice died successively and required a third --only-failed run. Probably worth checking out, usually this means a test with needlessly tight timing constraints.
IIRC @cocoeoli @lenghonglin are doing some work related to real hardware SMP (Raspberry Pi 4 and ROC-RK3568-PC?), no idea if you had a chance to have a test? |
Busy those day. I wiil try to complete smp on Raspberry Pi 4 B this weekend. |
Appreciate it, and really no rush, just take your time |
I'll run (well actually re-run) test suit on ARC HW SMP boards (HSDK and HSDK4xD) - both have 4 CPU cores. |
I've played a bit with fairness test on real SMP HW. In worst case I've got such results (with regular Zephyr spinlock implementation): HSDK (4 cores)
HSDK (2 cores)
HSDK4xD (4 cores)
That's expected as if core already has atomic variable in it's local cache (and in owned state) than this core will perform atomic operation faster than other cores. The test passes on all SMP HW when I enable ticket spinlock. I've also run full testsuit on all ARC SMP platforms with ticket spinlock enabled - and I don't see any additional failures. |
Basic spinlock implementation is based on single | ||
atomic variable and doesn't guarantee locking fairness | ||
across multiple CPUs. It's even possible that single CPU | ||
will win the contention every time which will result | ||
in a live-lock. | ||
Ticket spinlocks provide a FIFO order of lock acquisition | ||
which resolves such unfairness issue at the cost of slightly | ||
increased memory footprint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a future update it would be great to reword so that the help starts with what ticket spinlock is, rather than what it is not. I.e. the last sentence (with minor tweaks, proably) should be first.
Basic spinlock implementation is based on single
atomic variable and doesn't guarantee locking fairness across multiple CPUs. It's even possible that single CPU will win the contention every time which will result in a live-lock.
Ticket spinlocks provide a FIFO order of lock aquisition which resolves such unfairness issue at the cost of slightly increased memory footprint.