Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase parallel json presses from 32 to 64 #26336

Closed
wants to merge 1 commit into from

Conversation

georgeblahblah
Copy link
Contributor

Co-authored-by: Roberto Tyley [email protected]
Co-authored-by: Ioanna Kokkini [email protected]
Co-authored-by: Ravi [email protected]

What does this change?

  • Increase parallel json presses from 32 to 64
    We believe that the limit was too low. There are 2 instances of this class sharing 128 threads, so it can probably be higher.

This change is part of a solution to #26335.

See #18331 for where the limit originates from.

We believe that the limit was too low. There are 2 instances of this
class sharing 128 threads, so it can probably be higher.

See #26335 and
#18331 for the original change.
@georgeblahblah georgeblahblah requested a review from a team as a code owner July 20, 2023 13:58
@AshCorr
Copy link
Member

AshCorr commented Jul 20, 2023

Are we sure parallelisation is the bottleneck? Looking at our JSON parse times, the amount of time, for a single thread, to parse a single front, goes up significantly. I feel like this would imply that the cores on this machine are already working at their max?

FYI these nodes only have 8 cores, which would mean at max they would only have 16 physical threads?

image

https://metrics.gutools.co.uk/d/jkMv3tS4k/dotcom-dashboard?orgId=1&from=1689247102364&to=1689250345570&viewPanel=15 (data from the incident)

@AshCorr
Copy link
Member

AshCorr commented Jul 20, 2023

Created a dashboard for this incident https://metrics.gutools.co.uk/dashboard/snapshot/Fv5nt5BEOtENzRlL5ypErgHy4IzlQ6Bw

image

CPU Utilisation was at 100% for the duration. I think if anything we should probably LOWER how many pages can be processed at once by a single facia instance, this would atleast allow Facia to render some pages and fill the cache when an attack happens instead of smashing the CPU and not being able to render any page at all.

@rtyley
Copy link
Member

rtyley commented Jul 20, 2023

CPU Utilisation was at 100% for the duration.

This is really interesting - I'm kind of exhausted for today, but I don't suppose you have access to any JVM garbage collection statistics you could overlay on that dashboard?!

@AshCorr
Copy link
Member

AshCorr commented Jul 20, 2023

This is really interesting - I'm kind of exhausted for today, but I don't suppose you have access to any JVM garbage collection statistics you could overlay on that dashboard?!

We do indeed!

image

During the Incident we were garbage collecting ~300 times per minute and spending almost 24 seconds per minute doing it!

These metrics are our own custom metrics, not cloudwatches, so I'm not super confident on how accurate they are. But spending 24 seconds per minute just garbage collecting feels like an awful lot.

@rtyley
Copy link
Member

rtyley commented Jul 21, 2023

During the Incident we were garbage collecting ~300 times per minute and spending almost 24 seconds per minute doing it!

This is absolutely fascinating, and really helps unravel the problem - thank you for that, and for adding the graphs to your dedicated incident dashboard (tho unfortunately, the graphs show up very differently from the screenshots you posted for me - they appear to be showing a 'A-series' metric rather than the G1 stats shown in your screenshots, which doesn't show similar peaks 😞 - could you show me how to modify this display to get a view similar to yours?!).

The graphs you posted show that there is very heavy GC during the incident, and this probably accounts for the 100% CPU utilisation - so it's not exactly the JSON parsing that's consuming the CPU, it's the garbage collection of memory, which could possibly be caused by some other process in the JVM creating & releasing lots of memory, but yep, is almost certainly the memory consumption around the JSON parsing process itself - it's loading big JSON files (for instance, the /uk front is 5.9MB uncompressed).

The question: why so much garbage collection?

Although 6MB per front is big, we're running facia on c6g.2xlarge that have 16GB of memory each (the JVM process runs with -Xmx12033m), more than 2000x the amount of memory, so why are we having to do so much garbage collection, especially given the futureSemaphore which is set to a concurrent processes limit of only 32?

If we look at the pressedPageFromS3() code again, bear in mind that loadPressedPageFromS3() creates that big 6MB String - and then parsePressedPage() has that 32-concurrent-processes limit with futureSemaphore:

private def pressedPageFromS3(
path: String,
)(implicit executionContext: ExecutionContext): Future[Option[PressedPage]] =
errorLoggingF(s"FrontJsonFapi.pressedPageFromS3 $path") {
for {
s3FrontData <- loadPressedPageFromS3(path)
pressedPage <- parsePressedPage(s3FrontData)
} yield pressedPage
}

A possible failure mode that leads to excessive GC

  • A Facia app-server receives N fronts requests near-simultaneously, where N > 32, and possibly approaching 128.
  • Each request runs loadPressedPageFromS3() (while holding a blockingOperations thread), and loads N big S3 objects into memory as String objects
  • Each request then attempts to run parsePressedPage(), which uses futureSemaphore:
    • N - 32 of those requests fail with TooManyOperationsInProgress due to futureSemaphore throttling. The big String objects from these requests still need to be garbage collected - but no successful results have been generated and returned to the CDN. Those requests will come again! Optimistically, it takes ~2000 requests to completely fill 12GB of JVM heap with 6MB strings- we were receiving something like 2000 fronts requests per minute during the incident.
      image
    • The 32 requests that do get allowed by futureSemaphore still need to acquire a thread from the blockingOperations thread pool - and there's a shortage of those, because they're still getting used by the many loadPressedPageFromS3()calls. Waiting to acquiring the thread to run on takes time - we see FrontDecodingLatency increase. Throughput drops.
  • Many additional requests that arrive will fail, with TooManyOperationsInProgress, but only after having loaded more big strings into memory.
  • As GC CPU demands start to increase, the parsing itself in parsePressedPage() will start to get slower as it competes for CPU with the GC process trying to clear up all the big strings. So the futureSemaphore is held even longer, not just by thread starvation, but also by CPU starvation. Throughput drops further.
  • With futureSemaphore held even longer, even more requests fail with TooManyOperationsInProgress

It's worth noting that there was a dotcom:all deploy at 12:29pm - does this involve a cache-invalidation on the CDN?
That might explain a burst in simultaneous requests?

Timeline from logs and CloudWatch

  • Both CPU utilisation and FrontDecodingLatency start to rise at 12:35 (note that sub 1-minute resolution isn't available)
  • Facia requests logged a sharp increase in fronts requests at 12:35:06, going from ~5 requests per second to 21 rps, with 41 rps by 12:35:27.
  • TooManyOperationsInProgress started at 12:35:29

Throughput

If parsePressedPage() normally takes 0.1s, and we can run 32 concurrently, we can do parsePressedPage() 320 times per second on a single app server (32 / 0.1s). If it gets slow, taking 1.9 seconds, we can only do it 16 times per second -- we were receiving something like 60 fronts requests per second during the incident.
image

@georgeblahblah
Copy link
Contributor Author

Thanks so much @AshCorr & @rtyley! I think we should close this for now and focus on other areas of the investigation, like #26338

@ParisaTork
Copy link
Contributor

I saw this in the Google SRE book and thought it'd be interesting to share in case it's helpful, since Ash suggested the real issue is to do with memory allocation:

Dependencies among resources

Note that many of these resource exhaustion scenarios feed from one another—a service experiencing overload often has a host of secondary symptoms that can look like the root cause, making debugging difficult.

For example, imagine the following scenario:

1. A Java frontend has poorly tuned garbage collection (GC) parameters.
2. Under high (but expected) load, the frontend runs out of CPU due to GC.
3.  CPU exhaustion slows down completion of requests.
4.  The increased number of in-progress requests causes more RAM to be used to process the requests.
5.  Memory pressure due to requests, in combination with a fixed memory allocation for the frontend process as a whole, leaves less RAM available for caching.
6.  The reduced cache size means fewer entries in the cache, in addition to a lower hit rate.
7.  The increase in cache misses means that more requests fall through to the backend for servicing.
8.  The backend, in turn, runs out of CPU or threads.
9.  Finally, the lack of CPU causes basic health checks to fail, starting a cascading failure.

In situations as complex as the preceding scenario, it’s unlikely that the causal chain will be fully diagnosed during an outage. It might be very hard to determine that the backend crash was caused by a decrease in the cache rate in the frontend, particularly if the frontend and backend components have different owners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants