Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: spinlock: Ticket spinlocks #61541

Merged
merged 3 commits into from
Nov 4, 2023

Conversation

amr-sc
Copy link
Contributor

@amr-sc amr-sc commented Aug 16, 2023

Basic spinlock implementation is based on single
atomic variable and doesn't guarantee locking fairness across multiple CPUs. It's even possible that single CPU will win the contention every time which will result in a live-lock.

Ticket spinlocks provide a FIFO order of lock aquisition which resolves such unfairness issue at the cost of slightly increased memory footprint.

@peter-mitsis
Copy link
Collaborator

Very clever. Am I correct in understanding that in the ticket spinlock model, the locked field serves no functional purpose as the determination of it being locked/unlocked is solely dependent upon the owner and tail fields?

@amr-sc
Copy link
Contributor Author

amr-sc commented Aug 16, 2023

Very clever. Am I correct in understanding that in the ticket spinlock model, the locked field serves no functional purpose as the determination of it being locked/unlocked is solely dependent upon the owner and tail fields?

Hi @peter-mitsis ,

Indeed. Currently "locked" field is some kind of a compromise to pass the tests: in tests/kernel/spinlock we address this field directly (which is a bad idea, imho) to verify the spinlock state.

P.S.: We probably shouldn't access the spinlock internals directly in tests/kernel/spinlock and introduce k_spin_is_locked() API. But I don't have a firm opinion here since such API would only be used for the tests. So I just left it this way

@amr-sc
Copy link
Contributor Author

amr-sc commented Aug 17, 2023

Hi All,

Currently we have a test failure due to the following spinlock API extension in the module, which is external to Zephyr (modules/audio/sof/zephyr/include/rtos/spinlock.h):

/* not implemented on Zephyr, but used within SOF */
static inline void k_spinlock_init(struct k_spinlock *lock)
{
#ifdef CONFIG_SMP
	atomic_set(&lock->locked, 0);
#endif
}

What's our policy about it? Should we bring this API to Zephyr, or should we extend it in SOF?
@andyross , @galak , @nashif , @peter-mitsis what do you guys think?

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off, please check the correctness. This doesn't look right to me, but lockless algorithms are REALLY hard, so it's possible this will just need a round or seven of explanation.

Designwise: No complaints about having this in the stable of synchronization primitives, but absolutely not as the default for everyone. A spinlock is the minimum-overhead synchronization choice for routine mostly-uncontended critical sections. The kernel users don't want fairness, they want performance about 99% of the time.

Can you refactor as a k_fair_spinlock or whatnot?

atomic_t next_ticket;
atomic_val_t next_ticket_val;

atomic_set(&next_ticket, ticket_val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really suspicious to me. You're getting the value of "owner" and then writing it unconditionally to "next_ticket" with no synchronization. The following atomic_inc() is clearly not atomic, because the write is unconditional.

That is: what happens if you have two threads in exactly this code racing against each other? You can draw paths through the execution where next_ticket is incremented either once or twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andyross,

Here we just estimate how the l->owner and l->tail of the free spinlock would look like for the paricular l->owner. We don't really change the spinlock state until the CAS, we keep the estimation values on stack.
The atomic operations before the atomic_cas are done only for the sake of the values consistency. We can't just increment ticket_val since it's not how we change real l->owner and l->tail in k_spin_lock and k_spin_unlock:
the estimation should follow the same steps as the real algorithm where we use atomic operations on these fields.

The contending threads indeed may have the same ticket_val and next_ticket_val, but only the fastest thread will acquire the spinlock with

if (!atomic_cas(&l->tail, ticket_val, next_ticket_val)) {
		goto busy;
	}

The second thread will not met the CAS condition and will return -EBUSY.

Moreover, we are not entirely fair here since in case the spinlock will be acquired AFTER we "freeze" the ticket_val with atomic_get(&l->owner) and released BEFORE we come to atomic_cas we still will identify the spinlock as locked. So sometimes k_spin_trylock will fail with ticket spinlocks in condition when original spinlock will succeed. This is a side effect of having 2 atomic variables instead of 1 for ticket spinlocks.

@andyross
Copy link
Contributor

FWIW: that SOF thing looks like a mistake. Spinlocks don't have an initialization API, they're intended to be zero-filled as initialization so they can be used out of .bss before initialization hooks can be called etc... But if SOF really wants that we can provide a K_SPINLOCK_INIT token or something that just expands to an zero-filled struct.

@amr-sc
Copy link
Contributor Author

amr-sc commented Aug 21, 2023

First off, please check the correctness. This doesn't look right to me, but lockless algorithms are REALLY hard, so it's possible this will just need a round or seven of explanation.

Designwise: No complaints about having this in the stable of synchronization primitives, but absolutely not as the default for everyone. A spinlock is the minimum-overhead synchronization choice for routine mostly-uncontended critical sections. The kernel users don't want fairness, they want performance about 99% of the time.

Can you refactor as a k_fair_spinlock or whatnot?

Hi @andyross,

  1. Disable ticket spinlocks by default: sure, that's clear, let's disable them in the final version. I suggest however that we make the CI work so we ensure that all the tests pass with this feature enabled and then switch it off (and ensure that everything is still fine) before the merge. What do you say?
  2. k_fair_spinlock - please clarify what do you mean? Do you suggest to have a different public API for these spinlocks? This would mean that we wouldn't be able to seamlessly switch the kernel between "fairness"-vs-"footprint/performance". And the main motivation behind this change was that I've experienced a live-lock in the kernel on 4-core SMP platform due to original spinlocks unfairness.

P.S.: this concept is not my invention, Linux has passed this path long time ago: https://lwn.net/Articles/267968/

@amr-sc
Copy link
Contributor Author

amr-sc commented Aug 21, 2023

FWIW: that SOF thing looks like a mistake. Spinlocks don't have an initialization API, they're intended to be zero-filled as initialization so they can be used out of .bss before initialization hooks can be called etc... But if SOF really wants that we can provide a K_SPINLOCK_INIT token or something that just expands to an zero-filled struct.

Thanks! Then let's make a relative change to SOF. I will post a link here once I open a review there

@andyross
Copy link
Contributor

Yeah, for sure let's merge this as a separate API while it stabilizes. Adding a bunch of macros to do the switch at some point in the future is trivial, but even then I think that most apps that hit these concerns probably do want control over which locks get the treatment (because, even in Linux, the overwhelming majority of locks aren't contended).

Can you submit a bug on that live lock situation you hit? I'm betting it's going to root cause to something closer to a scheduler mistake, or be something that a tweaked arch_spin_relax() can treat more simply. And even if that's not possible, I think I'd be a lot more OK with a patch "replace the scheduler lock with a ticket spinlock" (especially if it came with a bug report and measurements!) than "all spinlocks are now ticket locks".

Basically, we don't run in contention environments anything like what Linux deals with. Zephyr platforms tend to care a lot more about code size and simplicity[1] than about this kind of optimization, and we don't have the depth of community expertise to draw on to maintain this stuff. The barrier to "let's put a novel lockless algorithm[2] right into the middle of the most important primitive in the kernel" is a ton higher for us than for Linux.

I don't have time to puzzle through the explanation about the trylock code, will get to that soon. (Though my first thought on scanning your text is "Oh, it's a guess at what we'll get" -- so maybe call the variable in question something like "guess"?).

[1] e.g. Think of all the CI overhead it would be testing both of these spinlock variants to the same level of coverage!

[2] Note that we just got support for standard memory barrier primitives about two months ago, and I'm willing to bet that the atomics layer is still going to have warts on some platforms (arm64 is the poster child for surprising out-of-order semantics, check there for sure) that would break this. I realize @carlocaione isn't on the review list and should be; barriers are his thing.

@amr-sc
Copy link
Contributor Author

amr-sc commented Aug 29, 2023

Hi @andyross , @carlocaione ,

What has been done to the moment:

@andyross , I have some doubts when I think about the separate API for this implementation. Here is what troubles me:
1.) we would need a distinct struct k_fair_spinlock, which would be mostly a duplicate of generic struct k_spinlock
2.) we would need to provide the same functionality as the generic spinlock has, including the features like CONFIG_SPIN_VALIDATE, etc. which would essentially mean a lot of duplicated code to keep in-sync
3.) the distinct API wouldn't fix the issue mentioned in #62043 , since the existing spinlocks like sched_spinlock in Zephyr kernel would not be affected

Could we have a configuration option and keep the ticket spinlocks disabled by default? This seems to be safe, if probably I'm missing something.. What do you say?

P.S.: the latest build failure seems not to be related to my changes (error in some device tree?)

devicetree error: 'config' is marked as deprecated in 'properties:' in /__w/zephyr/zephyr/dts/bindings/sensor/ti,ina230.yaml for node /soc/i2c@40005400/ina230@40.

@andyross
Copy link
Contributor

need to fix the new test, it fails almost 99% of the time when run locally. not sure how it passed CI...

It may depend on environment. On my box, when unloaded, it's 90% reliable or so. Most likely the problem is that host scheduling is failing to run the two threads simultaneously, and obviously a contention test only works if both CPUs are actually running. Having one halt magically because of the host scheduler is going to look very unfair. This may be hard to tune (e.g.: wait for some kind of "the other side is running" detection, then run for just a short time ("significantly less than CONFIG_HZ on the host kernel"), then repeat. Something like that.

@andyross does not want to have the code disabled by default in such a key area. Instead, he suggests having a canary platform that acts as a test introductory platform first

Right. More bluntly: none of my complaints rise to a level where I'd refuse it based on the code alone[1]. Prove to me that this runs on all our SMP platforms and doesn't break anything and I'll shut up and +1.

[1] The code itself looks fine, though I do think it wants some attention if we're going to be serious about this long term: rollover protection for sure, probably also 16 bit counters a-la Linux. (Or 15 and roll the locked bit into the same word), etc...

@zephyrbot zephyrbot added the area: RISCV RISCV Architecture (32-bit & 64-bit) label Oct 31, 2023
Basic spinlock implementation is based on single
atomic variable and doesn't guarantee locking fairness
across multiple CPUs. It's even possible that single CPU
will win the contention every time which will result
in a live-lock.

Ticket spinlocks provide a FIFO order of lock aquisition
which resolves such unfairness issue at the cost of slightly
increased memory footprint.

Signed-off-by: Alexander Razinkov <[email protected]>
Added test to verify the spinlock acquisition fairness
in relation to the CPUs contending for the spinlock.
This test is only enabled for Ticket Spinlocks which
required to provide such kind of fairness.

Signed-off-by: Alexander Razinkov <[email protected]>
Added test to verify that the value of atomic_t will be the same
in case of overflow if incremented in atomic and non-atomic manner

Signed-off-by: Alexander Razinkov <[email protected]>
@amr-sc
Copy link
Contributor Author

amr-sc commented Oct 31, 2023

Hi @andyross , @cfriedt , @nashif , @evgeniy-paltsev , @npitre , @carlescufi , @carlocaione , as was discussed on our Architecture WG meeting I've enabled CONFIG_TICKET_SPINLOCKS by default for qemu_riscv64_smp platform.
Is this enough or should I enable it for some other SMP platforms as well?

I've also prolonged a busy loop imitating some work within the test to help the threads synchronization which is critical for QEMU where a contending thread could be scheduled out in any moment which results in an artificial unfairness. This is not a 100% robust way to avoid the artificial unfairness, but let's see how it goes.

P.S.: QEMU has more advanced and precise emulation mode, CONFIG_QEMU_ICOUNT, but it is disabled for qemu_riscv64_smp and when I try to enable it the platform build fails.

@andyross
Copy link
Contributor

My memory from the last time we went through this was that icount provides deterministic emulation for only one core; in SMP the two CPUs still run in separate host threads, so it wouldn't address the problem.

@povergoing
Copy link
Member

FYR, just put some test reports here in case it might help. I tested it on a simulation platform fvp_baser_aemv8r_smp (Arm64 SMP 4 cores) which is a 100% deterministic platform, and all cores are simulated in only one host thread (at least one host core).
cmd: west build -p always -b fvp_baser_aemv8r_smp -t run -T zephyr/tests/kernel/spinlock/kernel.multiprocessing.spinlock_fairness

With CONFIG_TICKET_SPINLOCKS=y:

===================================================================
START - test_spinlock_fairness
CPU0 acquired spinlock 1000 times, expected 1000
CPU1 acquired spinlock 1000 times, expected 1000
CPU2 acquired spinlock 1000 times, expected 1000
CPU3 acquired spinlock 1000 times, expected 1000
 PASS - test_spinlock_fairness in 1.604 seconds
===================================================================

For comparison, CONFIG_TICKET_SPINLOCKS=n:

===================================================================
START - test_spinlock_fairness
CPU0 acquired spinlock 248 times, expected 1000
CPU1 acquired spinlock 153 times, expected 1000
CPU2 acquired spinlock 3599 times, expected 1000
CPU3 acquired spinlock 0 times, expected 1000

    Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/spinlock/src/spinlock_fairness.c:150: spinlock_test_spinlock_fairness: (spinlock_grabbed[core_id] < FAIRNESS_TEST_CYCLES_PER_CORE is true)
CPU0 starved on a spinlock: acquired 248 times, expected 1000

 FAIL - test_spinlock_fairness in 1.610 seconds
===================================================================

However, it's worth noting that FVP simulates all cores in a single host core, which means the host core might simulate e.g. ~1000 instructions per core sequentially (for example, we have core 0 ~ core 3, the host core simulates 1000 instructions for Core0 firstly, then for core1 and ...). That's why it's so "unfair" when CONFIG_TICKET_SPINLOCKS=n as the report shows. I think this exaggerated result will not happen in real hardware because real hardware executes instructions simultaneously.

@npitre
Copy link
Collaborator

npitre commented Nov 1, 2023 via email

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spent some time today with twister and the six[1] SMP emulation platforms. The upshot is that the spinlock test is much improved, but still failing significantly more than other tests[2] in a loaded CI run. Roughly half the failures in any run are this test:

  • A full twister suite with default kconfig saw 10 failures
  • Two full suite runs with tickets set to =y globally saw 19 the first time and 10 the second

Breaking out the spinlock test specifically and running it on an unloaded system, though, looks much better: one failure in 60 twister iterations (across 15 test binaries each, so 1 in 900). Given that the test is known to be able to fail stochastically in emulation, that seems actually OK to me.

Basically it looks good to me from an emulation perspective. I'm going to put a +1 on it as it's seems extremely unlikely to break anything at the integration level.

I would still love to see someone:

  1. Get some results from a full suite on a real SMP hardware platform
  2. Report some benchmarking numbers (not just the fairness results) on real hardware

[1] twister -p qemu_cortex_a53_smp -p qemu_riscv32_smp -p qemu_riscv64_smp -p qemu_x86_64 -p qemu_x86_64_nokpti, which as of this PR runs 1874 individual test binaries.

[2] Though an interesting second place I hadn't noticed was tests/subsys/zbus/integration. This dies once in almost every run, and twice died successively and required a third --only-failed run. Probably worth checking out, usually this means a test with needlessly tight timing constraints.

@povergoing
Copy link
Member

I would still love to see someone:
Get some results from a full suite on a real SMP hardware platform
Report some benchmarking numbers (not just the fairness results) on real hardware

IIRC @cocoeoli @lenghonglin are doing some work related to real hardware SMP (Raspberry Pi 4 and ROC-RK3568-PC?), no idea if you had a chance to have a test?

@lenghonglin
Copy link
Contributor

I would still love to see someone:
Get some results from a full suite on a real SMP hardware platform
Report some benchmarking numbers (not just the fairness results) on real hardware

IIRC @cocoeoli @lenghonglin are doing some work related to real hardware SMP (Raspberry Pi 4 and ROC-RK3568-PC?), no idea if you had a chance to have a test?

Busy those day. I wiil try to complete smp on Raspberry Pi 4 B this weekend.

@povergoing
Copy link
Member

Busy those day. I wiil try to complete smp on Raspberry Pi 4 B this weekend.

Appreciate it, and really no rush, just take your time

@amr-sc amr-sc requested a review from cfriedt November 2, 2023 07:34
@evgeniy-paltsev
Copy link
Collaborator

Get some results from a full suite on a real SMP hardware platform

I'll run (well actually re-run) test suit on ARC HW SMP boards (HSDK and HSDK4xD) - both have 4 CPU cores.

@evgeniy-paltsev
Copy link
Collaborator

I've played a bit with fairness test on real SMP HW.

In worst case I've got such results (with regular Zephyr spinlock implementation):

HSDK (4 cores)

START - test_spinlock_fairness
CPU0 acquired spinlock 0 times, expected 1000
CPU1 acquired spinlock 3997 times, expected 1000
CPU2 acquired spinlock 1 times, expected 1000
CPU3 acquired spinlock 2 times, expected 1000

HSDK (2 cores)

START - test_spinlock_fairness
CPU0 acquired spinlock 1998 times, expected 1000
CPU1 acquired spinlock 2 times, expected 1000

HSDK4xD (4 cores)

START - test_spinlock_fairness
CPU0 acquired spinlock 1999 times, expected 1000
CPU1 acquired spinlock 0 times, expected 1000
CPU2 acquired spinlock 1996 times, expected 1000
CPU3 acquired spinlock 5 times, expected 1000

That's expected as if core already has atomic variable in it's local cache (and in owned state) than this core will perform atomic operation faster than other cores.

The test passes on all SMP HW when I enable ticket spinlock.

I've also run full testsuit on all ARC SMP platforms with ticket spinlock enabled - and I don't see any additional failures.

@nashif nashif merged commit 5725585 into zephyrproject-rtos:main Nov 4, 2023
@amr-sc amr-sc deleted the ticket-spinlocks branch November 7, 2023 06:43
Comment on lines +1201 to +1208
Basic spinlock implementation is based on single
atomic variable and doesn't guarantee locking fairness
across multiple CPUs. It's even possible that single CPU
will win the contention every time which will result
in a live-lock.
Ticket spinlocks provide a FIFO order of lock acquisition
which resolves such unfairness issue at the cost of slightly
increased memory footprint.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a future update it would be great to reword so that the help starts with what ticket spinlock is, rather than what it is not. I.e. the last sentence (with minor tweaks, proably) should be first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Architecture Review Discussion in the Architecture WG required area: Kernel area: RISCV RISCV Architecture (32-bit & 64-bit)
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.