-
Notifications
You must be signed in to change notification settings - Fork 32
Experiencing some random hangs under heavy workload #47
Comments
Also all the tunable configs are the default. |
Hi there @raykzhao Thank you for your suggestion I'll try. |
sadly it didn't worked, tried: but the hangs still. |
Could you please also set Is RDB enabled? |
The hang still with kernel.sched_starve_factor=0
As I think it's enabled by default with the patch, I believe so. |
Could you please try without RDB? |
As I don't think there is a runtime way to disable it, it's necessary recompile it, right? |
Yes, you need to recompile. I think the version that was working for you was without RDB. Could you please attach the .config too? Also provide all technical information and versions like kernel, wine, which game, and what settings. Thanks |
Sure this one here was from my last compile on 5.13.8 config.txt.
and when you say settings, you say which ones? the cacule's ones? if it's, it's all the default. Now let me recompile it, will take some time. |
I have such lags in rdr2 (only) and setting kernel.sched_interactivity_factor=50 seems to be helping. It doesnt happen without RDB, but without RDB background load has stronger negative effects. I will test kernel.sched_starve_factor=0, too. |
Yes, I can confirm, disabling the RDB did the trick, no more hangs, thank you @hamadmarri. Also, not related to this issue but may I ask you, is there any straightforward tool to benchmark which of these tunable configs performs better? |
Hi @ltsdw , Good to hear it's working fine now, however, I really would like to troubleshoot why RDB causes these freezes. Regarding tunning, there is no specific way to test. I tried to make the defaults to work fine in general, but when you have any issue you can change them. You need to have a background on cpu scheduling so you can read about the every cacule sysctl and change them accordingly. I would like to keep this issue open until we see why RDB performs bad with wine. Thank you |
I suspect it is related to rcu calls and soft irq. I will post some fixes to try soon. Thank you |
@hamadmarri, you might be onto somethinmg. This game does ~160k context switches, that might have something to do with it. But BMQ handles it, so its doable. I'm looking forward those fixes. Keep up the good work. |
Hi @JohnyPeaN , @ltsdw To narrow down the troubleshooting, could you please try RDB with: Also, can you try with:
Or try vise versa, in cause you have most rcu configs are disabled try to enable them. Based on my RDB code review I have just did 2min ago, I am suspecting it is because nohz balancing. I am assuming that you are using no_hz_full? Please let me know if any of the above changes fix the freezes so I can propose a fix based on your feedback. If non of the above configs has any positive effects, then I can investigate something else. Thank you |
@hamadmarri needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later. |
Hi @JohnyPeaN I would advise you first try with Thank you |
ok, I'll try too, but I'll need some time, thank you! |
while compiling I noticed this:
and the building failed. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Oh, now I see it.
but now I'm confused, should I or not use PREEMPT=n? Apparently it can't be compiled without that! |
Hi @ltsdw Please try first with Thank you |
But now there is a compile error happening |
I think the compiling error is because the
|
Ok, I tested with
Just a question, should I still use the |
@hamadmarri |
Just tested with:
and also didn't work, the hangs still happening for me. |
Hi @ltsdw Another thing I would suspect is the autogroup. Have you tried to disable the autogroup? You may try to add |
Even when using games with futex2 i dont face in any issues. Going to test again, but im sure there is no problem. |
Hi @raykzhao I am not sure actually because most of the feedback are not strongly related. Non fixes worked with @ltsdw but with @JohnyPeaN changing to periodic hz worked. Also, I have tested my proposed fix and it reduces the performance to be worse than CFS balancer. I guess the best way is to make RDB works with periodic hz and without {auto, fair}_group. The locking issues on games could be a reason but why it is ok with CFS balancer and bad on RDB? I thought it was because the CFS balancer goes through If you don't mind @ALL could you please attach the cpu topology with Thank you |
I don't know if it was to put an image here or something else, but here: Thank you for your support! |
Hi @hamadmarri, |
@hamadmarri |
Hey @hamadmarri This is the machine I'm testing: |
@hamadmarri cacule = no lags So it seems that the compositor gets neglected under certain circumstances and although game renders its images, they are not shown. Here is top of perf session:
|
Hi @JohnyPeaN I think it is related to tick update where RDB-r3 needs to update the highest IS task in every tick. However, previous RDB version was using a bit different approach since enqueue was sorted. Do the lags happen on previous RDB version (where no sched_group support)? Thank you for the observation 👍 |
Hi @MoisesMH Just to double check, could you please try with What I am thinking is that you and @JohnyPeaN have many CPUs where there are high probability that some of them turn to idle state and no_hz wake up didn't work with RDB. Also I am afraid that @ltsdw needs to retry with Another suspicion is that the RDB-r3 balance tries to pick from all tasks in I am 100% sure that RDB is not considering the nohz kicking to wakeup idle cpus, and if setting periodic tick works for all of you, then we know that it is about nohz wakeup kicker. However, if @ltsdw still has freezes while using periodic tick, then we might have another issue as well. Please make sure that:
During testing to make sure that the freezes are not related to the cache or starve scores. Thank you |
@ltsdw mentioned he used rdb without autogroup and it gave him no spikes. At the moment, I've compiled the kernel I've used without the patch and these parameters: RDB Interval: 19 (default). Also I've tweaked some options for the kernel configuration. I'll post it here just in case you want to take a look: https://drive.google.com/file/d/1eR6NIPe88lc1SCz_nqGjNXPPRPOSavv0/view?usp=sharing For me it's weird because 15 minutes ago I was testing about 30 minutes of gameplay in Star Wars Battlefront II. I was using Mangohud latest version from AUR (not the mangohud-git one). The first 15 minutes approximately I've experienced no spikes at all and the framerate was constant and smooth, but, since then, I've encountered some little ones every 5 minutes I guess, which lasted 2 seconds each. Then it seemed spikes were gone, until my game froze 5 secs, just like when I've got autogroup enabled. After the freeze, audio and video were unpair for a second and then it turned back to normality. So It's more related to heavy workload, as the title of this forum suggests. My CPU usage was about 54 to 59% during gameplay and GPU at 99%, which is expected because of the graphics card rendering the shaders and everything else. I was using RDB-r2 I guess, because it's included in the linux-tkg kernel provided by @TkGlitch. I put the links below: Linux-tkg kernel configuration (he also quoted the cacule link he's using, which refers to the "latest commit 6f2ede5 on May 20"): So that is the cacule-rdb version I'm using. Should I test RDB-r3 or RDB-r2 is fine? I'm not sure exactly which version that commit belongs to, but I can test compiling it manually. I don't know if the AUR version linux-cacule-rdb presents these problems too, but I'll try first the one included on the linux-tkg kernel. I prefer it, because it has more patches which can increase performance and improve the cpu efficiency. However, I'm starting to think one of those patches could be causing the problem too. On the other hand, those theories you mention can be possible. I haven't tested without the compositor. I don't know how I could deactivate it. I'll search for that and test without it too. Currently I'm using OpenGL 2.0. There's also OpenGL 3.1 available. I've read many people suggested Compton as a replacement. I could test it too. That's my progress till now. I'll keep testing and I'll notify if CONFIG_HZ_PERIODIC=y and the parameters kernel.sched_cache_factor = 0 and kernel.sched_starve_factor = 0 make any difference. Thanks for the reply! |
hi @hamadmarri Just recompiled here, with
but no difference, I'm still experiencing the hangs. Also was mentioned the compositor here, I don't know if disabling the compositor worked for you @JohnyPeaN, but I tried disabling the compositor here and didn't make any difference (but I'm using picom, not plasma). So far what worked was disabling RDB altogether or using |
@hamadmarri i'm not sure which process is responsible for compositing, but I think its kwin_x11. Anyway it has normal priority (0) as the rest of the desktop. I will try to change its priority if it has an effect. Earlier RDB versions had these problems for me. Maybe they changed a little. Earlier RDB couldn't utilize all cores during compilation with #threads=#cores. This seems to be better now. In regards to these lags in game, it was similar. I'm also testing if foreground processes are affected by heavy background processes, like mentioned compilation with Also, I'm not recompiling to test autogroup on/off. Just to confirm does |
@JohnyPeaN |
hey @hamadmarri CC kernel/sched/clock.o On the other hand, when I compile with just tickless idle (CONFIG_NO_HZ_IDLE=y), the kernel compiles without any errors. I've only applied cacule, uksm, futex2, security, and more uarches patches. It also gave me errors because I've also applied fsync, which is a previous version of the more advanced futex2 approach, but they're other functionalities. That explains why, with the PKGBUILD provided by TkGlitch, gave me those errors too when CONFIG_HZ_PERIODIC=y is applied. I don't know why. I think you should inspect those lines. I don't think the other patches are causing the problem, since there's no other scheduler I've integrated, and CacULE replaces CFS. That's all the information I can provide. Greetings! NOTE: maybe the aim of your scheduler is only programmed to work exclusively with full tickless and just tickless idle kernels? Maybe I'm confused hehe |
Hi @MoisesMH Could you please try this fix #47 (comment) I will update the fix in the github soon. Thanks EDIT: |
hey @hamadmarri Linux-tkg kernel configuration (he also quoted the cacule link he's using, which refers to the "latest commit 6f2ede5 on May 20"): I've got to say, in my system, even with CONFIG_HZ_PERIODIC=y, it's still having lag spikes, but they're less frequent than with CONFIG_NO_HZ_IDLE=y. I also used the variables you suggested in my /etc/sysctl.conf kernel.sched_cache_factor = 0 and then executed "sudo sysctl --system" to apply the changes to kernel in the document but, still, those hangs are present. Disabling autogroup (kernel.sched_autogroup_enabled=0) helped a little to reduce the frequency of those lag spikes and its duration (lasted up to 2 secs each hang when it happens). Before, when CONFIG_NO_HZ_IDLE=y, they lasted 5 secs at average. In the menu, everything is smooth, even on gameplay, when the hangs are not present, the game runs butter-smoothly. Oh, another detail is that, while a hang is present, the Mangohud overlay reveals the CPU usage soared 10% more on average (from 55% to 65%, even it reached 74%). It's weird that It just happens on heavy workload. For other tasks, like running Audacious or Lutris, it's noticeably faster than without RDB. It surprises me the celerity at opening different applications. For those jobs it's butter-smooth, but just happens when at intensive gameplay. That's all I got. I really don't have an idea why it just happens at intensive workload. Maybe the code is not adapted to deal with it and just with ordinary tasks. I'll remain here for more news. Thanks for the effort. Keep it up! EDIT: could you apply the two last commits to the cacule-5.14.patch please? I want to apply it for testing with the futex2-dev kernel from Collabora. Thanks! |
Hi @MoisesMH Thank you |
Nice! Thanks to you. I don't know but lastly I've tried the liquorix kernel with the MuQSS scheduler (CONFIG_HZ_100=y is default), android modules, ntfs3 and uksm. What surprises me is the CPU usage. At the game menu of Star Wars Battlefront II, your scheduler with linux-tkg (CONFIG_HZ_1000=y is default) consumes 24% to 32-33% of CPU usage, but, with, this new kernel, it was reaching a whopping 54% to 59% of CPU Usage. I don't believe it's uksm which is incrementing CPU Usage, because its main function is memory deduplication. It's not possible in my opinion. Also, at gameplay, your scheduler were around 54% to 62% of CPU Usage, while lqx-kernel with MuQSS reached from 66% up to 79%. It's impressive how optimized the linux-tkg kernel is compared to liquorix. Well I haven't tried the linux-tkg with uksm. I'm going to compile it now and see how it does with CacULE with and without RDB for testing. Keep it up! |
It could be
Thank you |
Hey @hamadmarri |
Hi @MoisesMH , @raykzhao , @JohnyPeaN , @ltsdw , @ptr1337 , @SoongVilda I am planning to make a rework on RDB and start it over from the beginning. I need to review how nohz idle wakeup mechanism works first. Also I am thinking to make some extra features where some CPUs are assigned to be an interactive tasks servant (where it gives more priority to interactive tasks but still can run non-interactive tasks at the same time). This idea are based on this (https://www.researchgate.net/profile/Julien-Soula/publication/254213707_ARTiS_an_Asymmetric_Real-Time_Scheduler_for_Linux_on_Multi-Processor_Architectures/links/00b495350104a70d19000000/ARTiS-an-Asymmetric-Real-Time-Scheduler-for-Linux-on-Multi-Processor-Architectures.pdf) The next RDB must consider all nohz work, and maybe a global queue for candidates tasks in which one task from each CPU (the task that has the highest IS but not running). Each CPU will have one slot in the global queue and it must guarantee that the task that is advertised in the global queue must be ready to migrate at any time, unless the slot has a null value. The locking number could be increased but the queue is not very big it only contains Thanks |
Hey @hamadmarri EDIT: my current RDB value is 15 and running with CONFIG_NO_HZ_IDLE=y and CONFIG_NO_HZ=y. Also, I've disabled the compositor with Alt+Shift+F12 keys combination. EDIT 2: These are the combinations which gave me almost no spikes on Star Wars Battlefront II:
Incrementing cache factor were only worsen things and won't let enjoy a decent gameplay experience, since, in my opinion, too much can be the cause of those lag spikes. My numbers were inspired by my total installed memory, seen with "free" command on console. It returned a total of 15944 MB (16GB). Numbers I've got: 7972 1993 17937 19930 3986 5979 4983. |
Hi @MoisesMH The cache factor seems not working good with RDB design. I need to troubleshoot cache and starve factors too. Thank you |
Yeah, it seems to be generating the issue. I've discovered another combination, which is close to the default I think: kernel.sched_cache_factor = 10629 I've experienced no spikes at all with this configuration, but at the beginning of gameplay a peak happened, but I haven't noticed any freezes or big hangs. I think I'll remain with this configuration. Both sums less than the sched_interactivity_factor, also one is the double of the other (1/3 * 31888, 2/3 * 31888). Hope you're doing great with your investigation and development! |
I've been experiencing these hangs (where everything freezes for like 5 secs) when playing some games on wine that usually uses a lot of the CPU, sometimes when watching some videos.
To be sure that was cacule patch and nothing else I tested with the mainline arch kernel (no hangs). As I have some patches applied at my kernel I tried compiling it without the cacule patch (also no hangs). And then tried applying the cacule again and the hangs comes back.
I'm not quite sure. But I think that the commit that introduced it is the 06cb3974.
I didn't tried reverting the commit to test, only tested with these:
cacule-patch-with-hangs.txt - patch where hangs happens
cacule-without-hangs.txt - and without the hangs
But if needed I can try bisecting later to see exactly which commit causes it.
The text was updated successfully, but these errors were encountered: