-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge increase in deployment times for v6 upgrade #2987
Comments
Important notes:
|
@mjeffryes @t0yv0 Does one of you want to work with Lee on reproducing this issue? |
@mikhailshilkov we have all the data in this slack thread, we need eyes on this ASAP: |
Summarizing the investigation so far:
|
Quick update here. It appears that CheckConfig/Configure are taking 10 and 20 minutes respectively, and the customer stack performs them twice (two instances of AWS provider), leading to an unacceptable slowdown. @jaxxstorm contributed a workaround that is now released as v5.43.0 and reduces the customer urgency to land this fix on v6. This buys a little time but it remains a high priority for us to fix since at some point the customer will be compelled again to upgrade to v6. In terms of the actual fix, we still do not have a solid repro and we found we have an instrumentation gap. The traces do not explain why Configure/CheckConfig are slow and the verbose logs are silent around the timestamp gaps. With pulumi/pulumi-terraform-bridge#1534 we're trying to add visibility into Configure/CheckConfig for both SDKv2 an PF sub-providers. This change also adds memory instrumentation to get an idea of RAM statistics for the pulumi-resource-aws Go process. IN addition we are attempting to fix pulumi/pulumi-terraform-bridge#1536 which should recover Terraform-initiated logs under TF_LOG=trace environment variable. With these two pieces of instrumentation we'll work through another round of customer testing to gather more information. Current hypotheses:
|
IN pulumi/pulumi-aws#2987 a customer is experiencing very long Configure and CheckConfig times but our traces and logs provide limited visibility into what's going on. This PR adds a little more information. This is complicated by the fact that there's no root span for the provider and the provider process can be killed by the engine at any moment, so finalizing the root span is not reliable. Therefore the strategy of collecting mem stats in a background thread and tagging them to the root span, as in pulumi/pulumi does not work very well. Changes: - add spans for key sdkv2 methods GetPluginInfo Configure, CheckConfig, Diff, Create, Update, Delete, Invoke - all the spans sample the go memory stats for the running process and report them as tags on the span - since we are investigating Configure and CheckConfig specifically, additional tags are added to show how much time is spent tin PreConfigureCallbacks and in provider configuration validation - the latter is also instrumented for the PF version of Configure With these changes we have a little more visibility into PF vs SDKv2 Configure costs and have some memory stat samples to get an idea of peak memory use for the provider. This kind of hopes that sampling memory at the beginning of each span is sufficient; perhaps can be improved upon by sampling every 1s for long-running spans.
We suspect this may perf regression may have been introduced by #2949. We are working on a patch to change that implementation to avoid the additional synchronization it forces. If that is correct, there is still an underlying issue - that some Configure call is taking 10+ minutes. However, previously this may not have been observable - the Configure call that mattered (that other resource operations depended on) was happening quickly. We still will need to collect more detailed logs from folks hitting this to root cause where that slow Configure is coming from. But ideally, the fix to the logic in 2949 will unblock the practical perf regression a few folks are seeing here. |
v6.11.0-alpha.1 is released with a fix for the logic in #2949 and additional instrumentation. |
The investigation in the other issue has identified the expected root cause - ultimately due to aws/aws-sdk-go-v2#2353 - notes in #3044 (comment). |
We released https://github.com/pulumi/pulumi-aws/releases/tag/v6.12.0 this morning, which includes the fix from aws/aws-sdk-go-v2#2355. We've also verified that all versions of So we'll closed this out as fixed with the latest release. If you continue to see any performance issue, please do reopen! |
What happened?
A customer is reporting a large increase in deployment times when upgrading the AWS provider for a stack of approximately 1k resources.
Example
I have a performance trace for the issue. Please DM me to see it
Output of
pulumi about
N/A
Additional context
No response
Contributing
Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).
The text was updated successfully, but these errors were encountered: