-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix performance baseline breach #2799
Comments
It looks like there were a few random blips, but once pulumi/examples#1471 was merged to upgrade the examples from v5 to v6 of pulumi-aws on 8/31 at 7:33 AM PT, we spiked. Those spikes are all 9/1. |
@justinvp I wonder if that is just a coincidence. We've been on an alpha version of AWS v6 in examples for much of August: https://github.com/pulumi/examples/commits/master/aws-js-s3-folder/package.json |
v3.80.0 also release on 8/31 https://github.com/pulumi/pulumi/releases/tag/v3.80.0 |
@EvanBoyle, the commit history is deceiving. Those commits were part of the long open pulumi/examples#1471 PR, which wasn't squashed when merged. The PR started in early August, but didn't get merged into master until 8/31. |
Trying out |
I've collected some trace data over several runs with a few different variations: pulumi-aws/issues/2799 latency investigation A few observations:
|
I think I have a lead on where the slowness may be coming from: There's no subspan covering this time, but I suspect that it is where we are waiting on the provider plugin to start. Playing around with starting each plugin in my terminal, it does feel like there's a larger delay before printing the port number for v6. The difference is subtle (the traces indicate a difference of ~650ms), but since we restart the plugin multiple times during an update, it's probably enough to account for the ~3s increase in the overall update time. (We should probably add a tracing span to cover provider start times; if my hunch is correct, that trace could have saved me several hours of experimenting with different profiling runs.) |
Further testing found that the v6 aws plugin does take ~500ms longer just to get printing the version string. The flame graphs shows a significant amount of time spent in @t0yv0 confirmed that these are new to v6 as v6 is the first version of the aws provider that uses the tfbridge tools for bridging Provider Framework based tf providers. (v5 contained a number of patches to downgrade resources that the upstream had moved to PF back to tf SDKv2) I think that gives us the answer to our initial question of what is causing the slow-down, but there are a number of different items to follow up on to try to improve the situation: A large portion of the extra time is spend unmarshalling the large (45MB)
We could also:
Regardless of the approach we take to reducing init time; we should also:
|
It sounds like the source of impact is in our TFPF implementation, which is "expected" in some sense. @mjeffryes How does a 500ms delay in plugin loading become a 3-second jump on all the charts in the issue description? Are we loading the plugin multiple times. 45 MB sounds like a lot! I think that's comparable to metadata in Azure Native, which is an entirely different thing, but that provider has so much more resources and resource properties... Do we see ways to reduce that? For instance, I see things like What is your recommendation to proceed? Is there a low-hanging fruit, or do we need to park it until the next iteration? Is it still a P1? Could you pleas open issue for the next steps? |
Glad you could take a look here, Wafv2 rewrites not applied is a good catch! I think Matthew's list above is our low hanging fruit (the first two items). Rid of renames and try a faster format or else even just a faster deserializer like drop in https://github.com/json-iterator/go that we use in p/p |
Referencing pulumi/pulumi-azure-native#1689 here as some related prior art. @mjeffryes @t0yv0 maybe we can reuse some of those ideas. |
We are not done here until the perf benchmarks are passing again. I've renamed the issue to reflect that and am adding back the P1 label. |
We temporarily adjusted the alerting baselines until this issue has been resolved.
Part of addressing this issue is adjusting these back to lower than the previous values, given we're now reusing plugin processes across steps (pulumi/pulumi#13987). |
@mjeffryes @t0yv0 I spent some time hacking on a prototype for #2799 (comment) and wanted to share some findings. First, to recap, I was curious to see if there's any benefit to moving some of our runtime deserialization down to the build/compile step. The schema and metadata we're embedding into the binary don't change, so (in theory) we should be able to build the result of our computation directly into the binary instead of re-computing it at runtime. To set the baseline, benchmarking
My first experiment was to slightly simplify but not eliminate the serialization altogether. In particular I noticed we only have 3 top-level keys. Instead of embedding the entire var Data := tfbridge.Data{
M: map[string]json.RawMessage{
"auto-aliasing": json.RawMessage([]byte("{...}")), // taken from bridge-metadata.json
"mux": json.RawMessage([]byte("{...}")),
"renamed": json.RawMessage([]byte("{...}")),
},
} Initializing the provider with this eliminates the initial deserialization of the ~40MB blob, while subsequent operations still need to deserialize their smaller respective blobs. This is a relatively small/non-invasive change and cuts new provider time in ~half:
The next question was whether we could eliminate the deserialization altogether. The structs involved are large enough that we bump into golang/go#33437 but after some wrestling with the compiler it is possible to embed everything natively. (That is, deserialize all of the metadata, run The result is >200k lines of Go (~50MB) which takes several minutes to compile. Fortunately we can use build tags in a way that lets us continue to toggle between the embed approach and this one (otherwise it would be a nightmare to develop against!). This came in at about ~1/4 of the baseline:
From here the bottleneck then becomes I have literally zero context on providers beyond the hacking I've done here, but I can push these branches if they would be useful! |
@blampe Those are some pretty nice results! It's a bit strange that you get the roughly the same for Gen vs NoGen in all runs. I suspect it might lying and always running in one mode or the other. But I think we'd definitely be interested in seeing your branch and trying to lock in those wins. |
For sure, I could have also (very likely) borked something while I was hacking! I've pushed branches here and here. The "baseline codegen" commit is the first improvement (skipping the initial deserialization), and HEAD is the second (much crazier) codegen everything approach. Run |
With pulumi/pulumi-terraform-bridge#1524 we are now roughly back in line with the v5 performance Metabase We still need to complete the performance testing in the providers to avoid shipping a regression like this in the future, but we can probably lower the alert thresholds and close this issue now. |
By all means, let's do that so we can close. I wasn't aware we've improved here so I was doing some work on this in the iteration.
The changes in 1544 are a little invasive and I'm still verifying if they're safe. BenchmarkRuntimeProvider is down to 50-60ms. The runtime metadata is much smaller.
AutoAliasing is not a major contributor in the profile anymore. |
In terms of delivering more performance to our users around provider startup, going forward the super annoying cost seems to be in CheckConfig/Configure. Even with Luke's fix #3044 I'm getting something like 1s each for CheckConfig and Configure even in the basic cases. Setting this option as our perf-sensitive customers do: aws:skipCredentialsValidation true removes the cost of CheckConfig but not Configure. I'm not sure it's reasonable to spend 1s there, I suspect it's all about credentials and AWS client config. By comparison, |
Lowered the thresholds that we raised while investigating this issue:
|
Cannot close issue:
Please fix these problems and try again. |
With pulumi/pulumi#13654, I bumped the
aws-ts-s3-folder
threshold from 4500 to 4800 after local investigation.Now we're getting alerted that
aws-js-s3-folder
,aws-py-s3-folder
, andaws-ts-s3-folder
are over their thresholds for this test run. Needs investigation. Was it just a blip or consistent issue?The text was updated successfully, but these errors were encountered: