Switch to using the `GC_finalize` API for off-thread finalisation #22

jacob-hughes · 2024-08-22T08:50:18Z

Implementation

Using an extension of the disclaim API proved difficult because of the need to re-mark objects pointed to by finalisable objects before a finaliser is run. This would have led to complex changes to the mark-phase for no real benefit.

Instead, by switching back to the GC_finalize mechanism but ensuring that finalisers are run off of mutator threads, we get this desired behaviour 'for free', and we can delete lots of code to do with keeping finalisable buffers around. With this approach, the list of objects which need to be finalised is instead accessed via a linked-list threaded through the swept items in the heap. It is maintained by the BDWGC without us having to add any additional changes.

This approach uses a simple condition variable to sleep the finaliser thread when there is no finalisation work to do. This is then potentially resumed at the end of a GC cycle if that cycle discovered finalisable objects.

Performance

Though the previous buffered finalisation approach was unsound, it's still worth seeing how switching to using GC_finalize performs when compared to it because these new changes have a profound impact on finalisation speed. To measure this run Alloy on som-rs with 3 configurations for 30 iterations:

Buffered finalisation: our old approach
Spin: our new GC_finalize approach where the finalisation thread simply spins waiting for work
CondVar: our new GC_finalize approach where the finalisation thread sleeps when there is no more work to do.

All configurations are run with finaliser elision disabled, so each benchmark ends up finalising huge numbers of objects -- in the 10s of millions. The results are shown below.

For Buffered finalisation, DeltaBlue was never able to run to completion due to a bug. CondVar dramatically improves wall-clock time on QuickSor, BubbleSort, and Loop, but is statistically insignificant on most of the others.

However, it does cause a big regression on Sum. I think this is because of heap-growth: CondVar ends up finalising about 30% fewer objects than Spin on this particular benchmark. The direct consequence of this is that there is on average 30% more floating garbage in the heap which cannot be reclaimed, and that will mean that each new GC cycle has to do more marking / sweeping (not to mention slower allocation due to free-list fragmentation).

I'm going to make an educated guess that this is because of the additional synchronisation costs of the condition variable (+mutex lock/unlock, +atomic load for fin_q_empty check). In cases like Sum, where each GC cycle pushes huge numbers of finalisers to the queue, this additional synchronisation means the finaliser thread ends up getting further behind than if it were just allowed to spin -- though I must admit, it resulting in 30% fewer objects being finalised in this case is surprising to me!

I think on balance we should keep the condition variable, as this makes it much easier for us to spin up multiple finaliser threads in the future if we want to parallelise some of this work.

BDWGC is now a submodule in Alloy, so this step (which builds and run's Alloys test suite) will never actually test a BDWGC PRs changes. Instead, this would just grab the current Alloy submodule commit.

We have now switched to using fast TLS support, so this old POSIX get/set_specific support for tracking GC roots in TLS is no longer required.

ltratt · 2024-08-22T09:10:53Z

reclaim.c

+    while (GC_should_invoke_finalizers() == 0) {
+      pthread_cond_wait(&flzr_t_has_work, &flzr_mtx);
+    }
+    pthread_mutex_unlock(&flzr_mtx);


I think you might have a ping-pong situation here (warning: I might be wrong).

What I tend to do with condition variables (partly because I'm stupid and I find condvars hard) is associate a bool with the condvar. In essence, the condvar says "if true, there is work to do". Then I unset it in the "listening" thread so I end up with something like:

bool work_to_do = false; // this is protected by flzr_mtx while(1) { pthread_mutex_lock(&flzr_mtx); while (!work_to_do) { pthread_cond_wait(&flzr_t_has_work, &flzr_mtx); } work_todo = false; pthread_mutex_unlock(&flzr_mtx); // do actual work }

At the moment, IIUC, you aren't doing the equivalent of work_todo = false with the mutex locked, so I think you might end up a situation where the finalisation thread keeps getting spuriously woken up even when there's no actual work to do.

Now, amongst the many caveats in my thinking is that I don't know how GC_invoke_finalisers synchronises with the GC thread(s). Maybe this is all handled elsewhere in a nice way.

Er, I might be wrong about spurious wakeups: I think the problem is that you won't get woken up when you want.

Maybe this example in pizauth, which is very like what you've got here, might make sense of what I've said above. https://github.com/ltratt/pizauth/blob/master/src/server/notifier.rs#L33

In this case GC_should_invoke_finalizers() is a function that returns 0 or 1 depending on if there's work to do. It is acting as the work_todo bool in this case. If there are no finalisers to be run, this will always return 0.

GC_should_invoke_finalizers is essentially a wrapper around an atomic global so calling it is thread-safe. It only gets set to 1 from inside a collection (where the finalisers threads are suspended anyway). It gets unset by the finaliser thread when there are no more finalisers need to be run (again this is done atomically). I think this is the root of your issue -- should it be unset while we hold the lock?

GC_should_invoke_finalizers is essentially a wrapper around an atomic global

Aha!

should it be unset while we hold the lock

Yes I think so.

Try this: d281df7

I've just tested this: 56cc901. Let me know if you're happy and I'll squash.

Can you quickly check if the performance of sum is still the same or not with these changes?

I can, but even with 10 iterations it will take about 30 mins to run :)

Just check with 1 or 2? It'll probably be fairly obvious if this has any chance of changing performance.

OK, offline, we think we see a difference.

ltratt · 2024-08-22T14:20:07Z

Please squash.

Using an extension of the disclaim API proved difficult because of the need to re-mark objects pointed to by finalisable objects before a finaliser is run. This would have led to complex changes to the mark-phase for no real benefit. Instead, by switching back to the GC_finalize mechanism but ensuring that finalisers are run off of mutator threads, we get this desired behaviour 'for free'. This approach uses a simple condition variable to sleep the finaliser thread when there is no finalisation work to do. This is then potentially resumed at the end of a GC cycle if that cycle discovered finalisable objects.

This was replaced with the GC_finalize mechanism in e093be9.

jacob-hughes · 2024-08-22T14:20:41Z

Squashed!

Jake Hughes and others added 2 commits August 21, 2024 21:20

Remove pointless CI step

0d054f2

BDWGC is now a submodule in Alloy, so this step (which builds and run's Alloys test suite) will never actually test a BDWGC PRs changes. Instead, this would just grab the current Alloy submodule commit.

Remove old POSIX TLS APIs

773862b

We have now switched to using fast TLS support, so this old POSIX get/set_specific support for tracking GC roots in TLS is no longer required.

jacob-hughes assigned ltratt Aug 22, 2024

ltratt reviewed Aug 22, 2024

View reviewed changes

jacob-hughes added 2 commits August 22, 2024 15:20

Remove buffered disclaim finalisation

43d97f6

This was replaced with the GC_finalize mechanism in e093be9.

jacob-hughes force-pushed the switch_to_gc_finalize_api branch from 56cc901 to 43d97f6 Compare August 22, 2024 14:20

ltratt added this pull request to the merge queue Aug 22, 2024

Merged via the queue into softdevteam:master with commit 74bb8e4 Aug 22, 2024
19 of 367 checks passed

jacob-hughes mentioned this pull request Aug 24, 2024

Use the GC_finalize API for Gc<T> types softdevteam/alloy#122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to using the `GC_finalize` API for off-thread finalisation #22

Switch to using the `GC_finalize` API for off-thread finalisation #22

jacob-hughes commented Aug 22, 2024

ltratt Aug 22, 2024

ltratt Aug 22, 2024

jacob-hughes Aug 22, 2024

ltratt Aug 22, 2024

jacob-hughes Aug 22, 2024

jacob-hughes Aug 22, 2024

ltratt Aug 22, 2024

jacob-hughes Aug 22, 2024

ltratt Aug 22, 2024

ltratt Aug 22, 2024

ltratt commented Aug 22, 2024

jacob-hughes commented Aug 22, 2024

Switch to using the GC_finalize API for off-thread finalisation #22

Switch to using the GC_finalize API for off-thread finalisation #22

Conversation

jacob-hughes commented Aug 22, 2024

Implementation

Performance

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ltratt commented Aug 22, 2024

jacob-hughes commented Aug 22, 2024

Switch to using the `GC_finalize` API for off-thread finalisation #22

Switch to using the `GC_finalize` API for off-thread finalisation #22