-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to using the GC_finalize
API for off-thread finalisation
#22
Switch to using the GC_finalize
API for off-thread finalisation
#22
Conversation
BDWGC is now a submodule in Alloy, so this step (which builds and run's Alloys test suite) will never actually test a BDWGC PRs changes. Instead, this would just grab the current Alloy submodule commit.
We have now switched to using fast TLS support, so this old POSIX get/set_specific support for tracking GC roots in TLS is no longer required.
while (GC_should_invoke_finalizers() == 0) { | ||
pthread_cond_wait(&flzr_t_has_work, &flzr_mtx); | ||
} | ||
pthread_mutex_unlock(&flzr_mtx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might have a ping-pong situation here (warning: I might be wrong).
What I tend to do with condition variables (partly because I'm stupid and I find condvars hard) is associate a bool
with the condvar. In essence, the condvar says "if true, there is work to do". Then I unset it in the "listening" thread so I end up with something like:
bool work_to_do = false; // this is protected by flzr_mtx
while(1) {
pthread_mutex_lock(&flzr_mtx);
while (!work_to_do) {
pthread_cond_wait(&flzr_t_has_work, &flzr_mtx);
}
work_todo = false;
pthread_mutex_unlock(&flzr_mtx);
// do actual work
}
At the moment, IIUC, you aren't doing the equivalent of work_todo = false
with the mutex locked, so I think you might end up a situation where the finalisation thread keeps getting spuriously woken up even when there's no actual work to do.
Now, amongst the many caveats in my thinking is that I don't know how GC_invoke_finalisers
synchronises with the GC thread(s). Maybe this is all handled elsewhere in a nice way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Er, I might be wrong about spurious wakeups: I think the problem is that you won't get woken up when you want.
Maybe this example in pizauth, which is very like what you've got here, might make sense of what I've said above. https://github.com/ltratt/pizauth/blob/master/src/server/notifier.rs#L33
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case GC_should_invoke_finalizers()
is a function that returns 0 or 1 depending on if there's work to do. It is acting as the work_todo
bool in this case. If there are no finalisers to be run, this will always return 0.
GC_should_invoke_finalizers
is essentially a wrapper around an atomic global so calling it is thread-safe. It only gets set to 1 from inside a collection (where the finalisers threads are suspended anyway). It gets unset by the finaliser thread when there are no more finalisers need to be run (again this is done atomically). I think this is the root of your issue -- should it be unset while we hold the lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GC_should_invoke_finalizers is essentially a wrapper around an atomic global
Aha!
should it be unset while we hold the lock
Yes I think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try this: d281df7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just tested this: 56cc901. Let me know if you're happy and I'll squash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you quickly check if the performance of sum
is still the same or not with these changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can, but even with 10 iterations it will take about 30 mins to run :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just check with 1 or 2? It'll probably be fairly obvious if this has any chance of changing performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, offline, we think we see a difference.
Please squash. |
Using an extension of the disclaim API proved difficult because of the need to re-mark objects pointed to by finalisable objects before a finaliser is run. This would have led to complex changes to the mark-phase for no real benefit. Instead, by switching back to the GC_finalize mechanism but ensuring that finalisers are run off of mutator threads, we get this desired behaviour 'for free'. This approach uses a simple condition variable to sleep the finaliser thread when there is no finalisation work to do. This is then potentially resumed at the end of a GC cycle if that cycle discovered finalisable objects.
This was replaced with the GC_finalize mechanism in e093be9.
56cc901
to
43d97f6
Compare
Squashed! |
74bb8e4
Implementation
Using an extension of the disclaim API proved difficult because of the need to re-mark objects pointed to by finalisable objects before a finaliser is run. This would have led to complex changes to the mark-phase for no real benefit.
Instead, by switching back to the
GC_finalize
mechanism but ensuring that finalisers are run off of mutator threads, we get this desired behaviour 'for free', and we can delete lots of code to do with keeping finalisable buffers around. With this approach, the list of objects which need to be finalised is instead accessed via a linked-list threaded through the swept items in the heap. It is maintained by the BDWGC without us having to add any additional changes.This approach uses a simple condition variable to sleep the finaliser thread when there is no finalisation work to do. This is then potentially resumed at the end of a GC cycle if that cycle discovered finalisable objects.
Performance
Though the previous buffered finalisation approach was unsound, it's still worth seeing how switching to using
GC_finalize
performs when compared to it because these new changes have a profound impact on finalisation speed. To measure this run Alloy on som-rs with 3 configurations for 30 iterations:GC_finalize
approach where the finalisation thread simply spins waiting for workGC_finalize
approach where the finalisation thread sleeps when there is no more work to do.All configurations are run with finaliser elision disabled, so each benchmark ends up finalising huge numbers of objects -- in the 10s of millions. The results are shown below.
For Buffered finalisation, DeltaBlue was never able to run to completion due to a bug. CondVar dramatically improves wall-clock time on QuickSor, BubbleSort, and Loop, but is statistically insignificant on most of the others.
However, it does cause a big regression on Sum. I think this is because of heap-growth: CondVar ends up finalising about 30% fewer objects than Spin on this particular benchmark. The direct consequence of this is that there is on average 30% more floating garbage in the heap which cannot be reclaimed, and that will mean that each new GC cycle has to do more marking / sweeping (not to mention slower allocation due to free-list fragmentation).
I'm going to make an educated guess that this is because of the additional synchronisation costs of the condition variable (+mutex lock/unlock, +atomic load for fin_q_empty check). In cases like Sum, where each GC cycle pushes huge numbers of finalisers to the queue, this additional synchronisation means the finaliser thread ends up getting further behind than if it were just allowed to spin -- though I must admit, it resulting in 30% fewer objects being finalised in this case is surprising to me!
I think on balance we should keep the condition variable, as this makes it much easier for us to spin up multiple finaliser threads in the future if we want to parallelise some of this work.