Huge increase in deployment times for v6 upgrade #2987

jaxxstorm · 2023-11-14T21:25:24Z

What happened?

A customer is reporting a large increase in deployment times when upgrading the AWS provider for a stack of approximately 1k resources.

Example

I have a performance trace for the issue. Please DM me to see it

Output of `pulumi about`

N/A

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

The text was updated successfully, but these errors were encountered:

jaxxstorm · 2023-11-14T21:29:52Z

Important notes:

up is run with PULUMI_EXPERIMENTAL=1 PULUMI_SKIP_CHECKPOINTS=true PULUMI_OPTIMIZED_CHECKPOINT_PATCH=true pulumi up --yes

jaxxstorm · 2023-11-14T21:45:16Z

After viewing the trace, it seems every operation is taking considerably longer than expected

mikhailshilkov · 2023-11-15T16:12:35Z

@mjeffryes @t0yv0 Does one of you want to work with Lee on reproducing this issue?

jaxxstorm · 2023-11-15T20:31:34Z

@mikhailshilkov we have all the data in this slack thread, we need eyes on this ASAP:
https://pulumi.slack.com/archives/C9SGX9QA1/p1699996996558019

mjeffryes · 2023-11-17T04:38:07Z

Summarizing the investigation so far:

The customer is able to reliably trigger slow updates and imports with the v6 plugin, and fast updates with the v5 plugin; they have confirmed that the environment and program for each command are identical.
Pulumi has not yet successfully reproduced the slowdown.
We suspect that the problem might be in the state recorded by v6:
- The slow traces provided by the customer show many unusually long spans that don't directly depend on the resource provider (eg. the language plugin taking 100x longer to start)
- The customer has observed that even imports with v5 are slow after running an upgrade with v6 (and vice versa, the initial import with v6 is not slow when the last upgrade was made with v5)

t0yv0 · 2023-11-21T19:36:22Z

Quick update here. It appears that CheckConfig/Configure are taking 10 and 20 minutes respectively, and the customer stack performs them twice (two instances of AWS provider), leading to an unacceptable slowdown.

@jaxxstorm contributed a workaround that is now released as v5.43.0 and reduces the customer urgency to land this fix on v6. This buys a little time but it remains a high priority for us to fix since at some point the customer will be compelled again to upgrade to v6.

In terms of the actual fix, we still do not have a solid repro and we found we have an instrumentation gap. The traces do not explain why Configure/CheckConfig are slow and the verbose logs are silent around the timestamp gaps.

With pulumi/pulumi-terraform-bridge#1534 we're trying to add visibility into Configure/CheckConfig for both SDKv2 an PF sub-providers. This change also adds memory instrumentation to get an idea of RAM statistics for the pulumi-resource-aws Go process.

IN addition we are attempting to fix pulumi/pulumi-terraform-bridge#1536 which should recover Terraform-initiated logs under TF_LOG=trace environment variable.

With these two pieces of instrumentation we'll work through another round of customer testing to gather more information.

Current hypotheses:

increased RAM consumption on V6
pathologically slow Configure or CheckConfig on the PF sub-provider - similar to Performance degradation with custom endpoints #2880 which is fixed on the version the customer is trying but does indicate this is a problematic area.

IN pulumi/pulumi-aws#2987 a customer is experiencing very long Configure and CheckConfig times but our traces and logs provide limited visibility into what's going on. This PR adds a little more information. This is complicated by the fact that there's no root span for the provider and the provider process can be killed by the engine at any moment, so finalizing the root span is not reliable. Therefore the strategy of collecting mem stats in a background thread and tagging them to the root span, as in pulumi/pulumi does not work very well. Changes: - add spans for key sdkv2 methods GetPluginInfo Configure, CheckConfig, Diff, Create, Update, Delete, Invoke - all the spans sample the go memory stats for the running process and report them as tags on the span - since we are investigating Configure and CheckConfig specifically, additional tags are added to show how much time is spent tin PreConfigureCallbacks and in provider configuration validation - the latter is also instrumented for the PF version of Configure With these changes we have a little more visibility into PF vs SDKv2 Configure costs and have some memory stat samples to get an idea of peak memory use for the provider. This kind of hopes that sampling memory at the beginning of each span is sufficient; perhaps can be improved upon by sampling every 1s for long-running spans.

lukehoban · 2023-11-27T18:20:13Z

We suspect this may perf regression may have been introduced by #2949. We are working on a patch to change that implementation to avoid the additional synchronization it forces.

If that is correct, there is still an underlying issue - that some Configure call is taking 10+ minutes. However, previously this may not have been observable - the Configure call that mattered (that other resource operations depended on) was happening quickly. We still will need to collect more detailed logs from folks hitting this to root cause where that slow Configure is coming from.

But ideally, the fix to the logic in 2949 will unblock the practical perf regression a few folks are seeing here.

t0yv0 · 2023-11-28T14:28:29Z

v6.11.0-alpha.1 is released with a fix for the logic in #2949 and additional instrumentation.

lukehoban · 2023-11-29T03:03:19Z

The investigation in the other issue has identified the expected root cause - ultimately due to aws/aws-sdk-go-v2#2353 - notes in #3044 (comment).

lukehoban · 2023-11-29T17:54:33Z

We released https://github.com/pulumi/pulumi-aws/releases/tag/v6.12.0 this morning, which includes the fix from aws/aws-sdk-go-v2#2355. We've also verified that all versions of http://github.com/aws/aws-sdk-go-v2/config and github.com/aws/aws-sdk-go-v2/internal/ini match upstream.

So we'll closed this out as fixed with the latest release. If you continue to see any performance issue, please do reopen!

jaxxstorm added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Nov 14, 2023

t0yv0 added the impact/performance Something is slower than expected label Nov 14, 2023

mikhailshilkov added needs-repro Needs repro steps before it can be triaged or fixed p1 A bug severe enough to be the next item assigned to an engineer and removed needs-triage Needs attention from the triage team labels Nov 17, 2023

mikhailshilkov assigned mjeffryes Nov 17, 2023

t0yv0 assigned t0yv0 and unassigned mjeffryes Nov 17, 2023

t0yv0 mentioned this issue Nov 20, 2023

Additional perf instrumentation pulumi/pulumi-terraform-bridge#1534

Merged

iwahbe mentioned this issue Nov 21, 2023

No logs recorded from the upstream provider even when TF_LOG=trace is set pulumi/pulumi-terraform-bridge#1536

Closed

lukehoban mentioned this issue Nov 25, 2023

v6.8.0 and upwards lags when multiple AWS profiles exist in the environment #3044

Closed

lukehoban added the resolution/fixed This issue was fixed label Nov 29, 2023

lukehoban closed this as completed Nov 29, 2023

t0yv0 mentioned this issue Dec 1, 2023

Ensure tracking upstream replacements and excludes #3076

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge increase in deployment times for v6 upgrade #2987

Huge increase in deployment times for v6 upgrade #2987

jaxxstorm commented Nov 14, 2023

jaxxstorm commented Nov 14, 2023

jaxxstorm commented Nov 14, 2023

mikhailshilkov commented Nov 15, 2023

jaxxstorm commented Nov 15, 2023

mjeffryes commented Nov 17, 2023

t0yv0 commented Nov 21, 2023

lukehoban commented Nov 27, 2023

t0yv0 commented Nov 28, 2023

lukehoban commented Nov 29, 2023 •

edited

Loading

lukehoban commented Nov 29, 2023

Huge increase in deployment times for v6 upgrade #2987

Huge increase in deployment times for v6 upgrade #2987

Comments

jaxxstorm commented Nov 14, 2023

What happened?

Example

Output of pulumi about

Additional context

Contributing

jaxxstorm commented Nov 14, 2023

jaxxstorm commented Nov 14, 2023

mikhailshilkov commented Nov 15, 2023

jaxxstorm commented Nov 15, 2023

mjeffryes commented Nov 17, 2023

t0yv0 commented Nov 21, 2023

lukehoban commented Nov 27, 2023

t0yv0 commented Nov 28, 2023

lukehoban commented Nov 29, 2023 • edited Loading

lukehoban commented Nov 29, 2023

Output of `pulumi about`

lukehoban commented Nov 29, 2023 •

edited

Loading