-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase parallel json presses from 32 to 64 #26336
Conversation
Are we sure parallelisation is the bottleneck? Looking at our JSON parse times, the amount of time, for a single thread, to parse a single front, goes up significantly. I feel like this would imply that the cores on this machine are already working at their max? FYI these nodes only have 8 cores, which would mean at max they would only have 16 physical threads? https://metrics.gutools.co.uk/d/jkMv3tS4k/dotcom-dashboard?orgId=1&from=1689247102364&to=1689250345570&viewPanel=15 (data from the incident) |
Created a dashboard for this incident https://metrics.gutools.co.uk/dashboard/snapshot/Fv5nt5BEOtENzRlL5ypErgHy4IzlQ6Bw CPU Utilisation was at 100% for the duration. I think if anything we should probably LOWER how many pages can be processed at once by a single facia instance, this would atleast allow Facia to render some pages and fill the cache when an attack happens instead of smashing the CPU and not being able to render any page at all. |
This is really interesting - I'm kind of exhausted for today, but I don't suppose you have access to any JVM garbage collection statistics you could overlay on that dashboard?! |
This is absolutely fascinating, and really helps unravel the problem - thank you for that, and for adding the graphs to your dedicated incident dashboard (tho unfortunately, the graphs show up very differently from the screenshots you posted for me - they appear to be showing a 'A-series' metric rather than the G1 stats shown in your screenshots, which doesn't show similar peaks 😞 - could you show me how to modify this display to get a view similar to yours?!). The graphs you posted show that there is very heavy GC during the incident, and this probably accounts for the 100% CPU utilisation - so it's not exactly the JSON parsing that's consuming the CPU, it's the garbage collection of memory, which could possibly be caused by some other process in the JVM creating & releasing lots of memory, but yep, is almost certainly the memory consumption around the JSON parsing process itself - it's loading big JSON files (for instance, the /uk front is 5.9MB uncompressed). The question: why so much garbage collection?Although 6MB per front is big, we're running facia on If we look at the frontend/common/app/services/fronts/FrontJsonFapi.scala Lines 52 to 60 in bc50652
A possible failure mode that leads to excessive GC
It's worth noting that there was a Timeline from logs and CloudWatch
ThroughputIf |
I saw this in the Google SRE book and thought it'd be interesting to share in case it's helpful, since Ash suggested the real issue is to do with memory allocation:
|
Co-authored-by: Roberto Tyley [email protected]
Co-authored-by: Ioanna Kokkini [email protected]
Co-authored-by: Ravi [email protected]
What does this change?
We believe that the limit was too low. There are 2 instances of this class sharing 128 threads, so it can probably be higher.
This change is part of a solution to #26335.
See #18331 for where the limit originates from.