Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance baseline breach #2799

Closed
justinvp opened this issue Aug 25, 2023 · 38 comments
Closed

Fix performance baseline breach #2799

justinvp opened this issue Aug 25, 2023 · 38 comments
Assignees
Labels
impact/performance Something is slower than expected kind/bug Some behavior is incorrect or out of spec p1 A bug severe enough to be the next item assigned to an engineer resolution/fixed This issue was fixed
Milestone

Comments

@justinvp
Copy link
Member

justinvp commented Aug 25, 2023

Screen Shot 2023-08-25 at 7 40 07 AM

With pulumi/pulumi#13654, I bumped the aws-ts-s3-folder threshold from 4500 to 4800 after local investigation.

Now we're getting alerted that aws-js-s3-folder, aws-py-s3-folder, and aws-ts-s3-folder are over their thresholds for this test run. Needs investigation. Was it just a blip or consistent issue?

@justinvp justinvp added kind/bug Some behavior is incorrect or out of spec area/cli Impacts the Pulumi CLI impact/performance Something is slower than expected p1 A bug severe enough to be the next item assigned to an engineer labels Aug 25, 2023
@justinvp justinvp added this to the 0.93 milestone Aug 25, 2023
@justinvp justinvp self-assigned this Aug 25, 2023
@justinvp justinvp modified the milestones: 0.93, 0.94 Sep 6, 2023
@justinvp
Copy link
Member Author

justinvp commented Sep 9, 2023

It looks like there were a few random blips, but once pulumi/examples#1471 was merged to upgrade the examples from v5 to v6 of pulumi-aws on 8/31 at 7:33 AM PT, we spiked. Those spikes are all 9/1.

Screen Shot 2023-09-08 at 9 51 20 PM Screen Shot 2023-09-08 at 9 51 29 PM Screen Shot 2023-09-08 at 9 51 38 PM Screen Shot 2023-09-08 at 9 51 51 PM

@EvanBoyle
Copy link
Contributor

@justinvp I wonder if that is just a coincidence. We've been on an alpha version of AWS v6 in examples for much of August: https://github.com/pulumi/examples/commits/master/aws-js-s3-folder/package.json

@EvanBoyle
Copy link
Contributor

v3.80.0 also release on 8/31 https://github.com/pulumi/pulumi/releases/tag/v3.80.0

@justinvp
Copy link
Member Author

justinvp commented Sep 9, 2023

We've been on an alpha version of AWS v6 in examples for much of August:

@EvanBoyle, the commit history is deceiving. Those commits were part of the long open pulumi/examples#1471 PR, which wasn't squashed when merged. The PR started in early August, but didn't get merged into master until 8/31.

@justinvp
Copy link
Member Author

justinvp commented Sep 9, 2023

Trying out aws-js-s3-folder locally with the same version of the CLI (dev build of 3.81.1), I'm seeing a consistent difference (a couple seconds) between running an empty update with v5 vs v6.

@justinvp justinvp removed the area/cli Impacts the Pulumi CLI label Sep 11, 2023
@justinvp justinvp removed their assignment Sep 11, 2023
@justinvp justinvp added the needs-triage Needs attention from the triage team label Sep 11, 2023
@justinvp justinvp removed this from the 0.94 milestone Sep 11, 2023
@justinvp justinvp changed the title CLI performance metric breach Investigate perf regression in pulumi-aws v6 Sep 11, 2023
@justinvp justinvp transferred this issue from pulumi/pulumi Sep 11, 2023
@mikhailshilkov mikhailshilkov added p1 A bug severe enough to be the next item assigned to an engineer and removed p1 A bug severe enough to be the next item assigned to an engineer labels Sep 11, 2023
@mikhailshilkov mikhailshilkov removed the needs-triage Needs attention from the triage team label Sep 11, 2023
@mikhailshilkov mikhailshilkov added this to the 0.94 milestone Sep 11, 2023
@mjeffryes
Copy link
Member

I've collected some trace data over several runs with a few different variations: pulumi-aws/issues/2799 latency investigation

A few observations:

  1. The latency increase is present in an empty update.
  2. There is a smaller latency increase for both preview-only and update-no-preview commands
  3. The latency increase is tied to the plugin version used and is not affected by the sdk version
  4. The latency delta between versions does not seem to be influenced by how many resources you are creating
  5. Most confounding to me so far is that the traces don't show a significant increase in latency for Check or Diff (or any other provider method), but the parent span(s) for RegisterResource do show the latency.

@mjeffryes
Copy link
Member

mjeffryes commented Sep 15, 2023

I think I have a lead on where the slowness may be coming from:
There's a significant change in the spacing in /pulumirpc.ResourceMonitor/RegisterResource on either side of the /pulumirpc.ResourceProvider/CheckConfig trace for v6:

v6
v5

There's no subspan covering this time, but I suspect that it is where we are waiting on the provider plugin to start. Playing around with starting each plugin in my terminal, it does feel like there's a larger delay before printing the port number for v6. The difference is subtle (the traces indicate a difference of ~650ms), but since we restart the plugin multiple times during an update, it's probably enough to account for the ~3s increase in the overall update time.

(We should probably add a tracing span to cover provider start times; if my hunch is correct, that trace could have saved me several hours of experimenting with different profiling runs.)

@mjeffryes
Copy link
Member

mjeffryes commented Sep 15, 2023

Further testing found that the v6 aws plugin does take ~500ms longer just to get printing the version string. The flame graphs shows a significant amount of time spent in tfbridge.NewProviderMetadata and tfbridge.MustApplyAutoAliases:

Flamegraph

@t0yv0 confirmed that these are new to v6 as v6 is the first version of the aws provider that uses the tfbridge tools for bridging Provider Framework based tf providers. (v5 contained a number of patches to downgrade resources that the upstream had moved to PF back to tf SDKv2)

I think that gives us the answer to our initial question of what is causing the slow-down, but there are a number of different items to follow up on to try to improve the situation:

A large portion of the extra time is spend unmarshalling the large (45MB) bridge-metadata.json file:

We could also:

  • Take a lazy approach to loading bridge metadata so that we don't eat the whole cost of loading the file on init. To avoid simply shifting the cost around, we might need a storage format that facilitates random access.

Regardless of the approach we take to reducing init time; we should also:

@mikhailshilkov
Copy link
Member

It sounds like the source of impact is in our TFPF implementation, which is "expected" in some sense.

@mjeffryes How does a 500ms delay in plugin loading become a 3-second jump on all the charts in the issue description? Are we loading the plugin multiple times.

45 MB sounds like a lot! I think that's comparable to metadata in Azure Native, which is an entirely different thing, but that provider has so much more resources and resource properties... Do we see ways to reduce that? For instance, I see things like RuleGroupRuleStatementAndStatementStatementNotStatementStatementRegexPatternSetReferenceStatementFieldToMatchHeader in it. Does it mean our WafV2 rewrites aren't working for metadata?

What is your recommendation to proceed? Is there a low-hanging fruit, or do we need to park it until the next iteration? Is it still a P1?

Could you pleas open issue for the next steps?

@t0yv0
Copy link
Member

t0yv0 commented Sep 18, 2023

Glad you could take a look here, Wafv2 rewrites not applied is a good catch! I think Matthew's list above is our low hanging fruit (the first two items). Rid of renames and try a faster format or else even just a faster deserializer like drop in https://github.com/json-iterator/go that we use in p/p

@mikhailshilkov
Copy link
Member

Referencing pulumi/pulumi-azure-native#1689 here as some related prior art. @mjeffryes @t0yv0 maybe we can reuse some of those ideas.

@EvanBoyle EvanBoyle reopened this Sep 29, 2023
@EvanBoyle EvanBoyle changed the title Investigate perf regression in pulumi-aws v6 Fix performance baseline breach Sep 29, 2023
@EvanBoyle
Copy link
Contributor

We are not done here until the perf benchmarks are passing again. I've renamed the issue to reflect that and am adding back the P1 label.

@EvanBoyle EvanBoyle added p1 A bug severe enough to be the next item assigned to an engineer and removed resolution/fixed This issue was fixed labels Sep 29, 2023
@mikhailshilkov mikhailshilkov modified the milestones: 0.94, 0.95 Oct 2, 2023
@justinvp
Copy link
Member Author

justinvp commented Oct 17, 2023

We temporarily adjusted the alerting baselines until this issue has been resolved.

'pulumi-update-empty', 'aws-py-s3-folder', 4800 => 5800
'pulumi-update-empty', 'aws-js-s3-folder', 4500 => 5800
'pulumi-update-empty', 'aws-ts-s3-folder', 4800 => 5800

Part of addressing this issue is adjusting these back to lower than the previous values, given we're now reusing plugin processes across steps (pulumi/pulumi#13987).

@mikhailshilkov mikhailshilkov modified the milestones: 0.95, 0.96 Oct 26, 2023
@blampe
Copy link

blampe commented Nov 8, 2023

@mjeffryes @t0yv0 I spent some time hacking on a prototype for #2799 (comment) and wanted to share some findings.

First, to recap, I was curious to see if there's any benefit to moving some of our runtime deserialization down to the build/compile step. The schema and metadata we're embedding into the binary don't change, so (in theory) we should be able to build the result of our computation directly into the binary instead of re-computing it at runtime.

To set the baseline, benchmarking Provider() (with isTfGen true/false) on master gave me:

BenchmarkProviderGen-10    	       5	 228882600 ns/op	423606320 B/op	 2848954 allocs/op
BenchmarkProviderNoGen-10    	       5	 203570350 ns/op	347397752 B/op	 2308434 allocs/op

My first experiment was to slightly simplify but not eliminate the serialization altogether. In particular I noticed we only have 3 top-level keys. Instead of embedding the entire bridge-metadata.json struct we can generate a hack.go file which pulls out those fields, while still keeping the raw JSON around.

var Data := tfbridge.Data{
    M: map[string]json.RawMessage{
        "auto-aliasing": json.RawMessage([]byte("{...}")), // taken from bridge-metadata.json
        "mux": json.RawMessage([]byte("{...}")),
        "renamed": json.RawMessage([]byte("{...}")),
    },
}

Initializing the provider with this eliminates the initial deserialization of the ~40MB blob, while subsequent operations still need to deserialize their smaller respective blobs. This is a relatively small/non-invasive change and cuts new provider time in ~half:

BenchmarkProviderGen-10    	       9	 113264319 ns/op	303111697 B/op	 1918905 allocs/op
BenchmarkProviderNoGen-10    	      10	 102601083 ns/op	277265940 B/op	 1615543 allocs/op

The next question was whether we could eliminate the deserialization altogether. The structs involved are large enough that we bump into golang/go#33437 but after some wrestling with the compiler it is possible to embed everything natively. (That is, deserialize all of the metadata, run MustApplyAutoAliases, and persist the result in code.)

The result is >200k lines of Go (~50MB) which takes several minutes to compile. Fortunately we can use build tags in a way that lets us continue to toggle between the embed approach and this one (otherwise it would be a nightmare to develop against!). This came in at about ~1/4 of the baseline:

BenchmarkProviderGen-10    	      20	  51311821 ns/op	156296561 B/op	  883827 allocs/op
BenchmarkProviderNoGen-10    	      20	  52889356 ns/op	156296712 B/op	  883832 allocs/op

From here the bottleneck then becomes NewUpstreamProvider, which seems good? I'm not sure if we could do a similar trick with that one.

I have literally zero context on providers beyond the hacking I've done here, but I can push these branches if they would be useful!

@mjeffryes
Copy link
Member

@blampe Those are some pretty nice results! It's a bit strange that you get the roughly the same for Gen vs NoGen in all runs. I suspect it might lying and always running in one mode or the other. But I think we'd definitely be interested in seeing your branch and trying to lock in those wins.

@blampe
Copy link

blampe commented Nov 13, 2023

It's a bit strange that you get the roughly the same for Gen vs NoGen in all runs. I suspect it might lying and always running in one mode or the other.

For sure, I could have also (very likely) borked something while I was hacking!

I've pushed branches here and here.

The "baseline codegen" commit is the first improvement (skipping the initial deserialization), and HEAD is the second (much crazier) codegen everything approach.

Run go run cmd/codegen/main.go to generate the hack.go file. To build the full codegen approach you'll need to add a build tag go build -tags hack ./... -- that one took upwards of 10 minutes for me locally.

@mjeffryes
Copy link
Member

mjeffryes commented Nov 30, 2023

With pulumi/pulumi-terraform-bridge#1524 we are now roughly back in line with the v5 performance Metabase

We still need to complete the performance testing in the providers to avoid shipping a regression like this in the future, but we can probably lower the alert thresholds and close this issue now.

@t0yv0
Copy link
Member

t0yv0 commented Nov 30, 2023

By all means, let's do that so we can close.

I wasn't aware we've improved here so I was doing some work on this in the iteration.

The changes in 1544 are a little invasive and I'm still verifying if they're safe.

BenchmarkRuntimeProvider is down to 50-60ms.

The runtime metadata is much smaller.

du -sh provider/cmd/pulumi-resource-aws/runtime-bridge-metadata.json                                                                                       ~/code/pulumi-aws
 88K    provider/cmd/pulumi-resource-aws/runtime-bridge-metadata.json

AutoAliasing is not a major contributor in the profile anymore.

@t0yv0
Copy link
Member

t0yv0 commented Nov 30, 2023

In terms of delivering more performance to our users around provider startup, going forward the super annoying cost seems to be in CheckConfig/Configure. Even with Luke's fix #3044 I'm getting something like 1s each for CheckConfig and Configure even in the basic cases. Setting this option as our perf-sensitive customers do:

aws:skipCredentialsValidation true

removes the cost of CheckConfig but not Configure.

I'm not sure it's reasonable to spend 1s there, I suspect it's all about credentials and AWS client config. By comparison, aws s3 ls takes 600ms to not only get authenticated but also retrieve the buckets. We should be able to do better there.

@mjeffryes
Copy link
Member

Lowered the thresholds that we raised while investigating this issue:

'pulumi-update-empty', 'aws-py-s3-folder', 5800 -> 4800
'pulumi-update-empty', 'aws-js-s3-folder', 5800 -> 4800
'pulumi-update-empty', 'aws-ts-s3-folder', 5800 -> 4800

@pulumi-bot pulumi-bot reopened this Nov 30, 2023
@pulumi-bot
Copy link
Contributor

Cannot close issue:

  • does not have required labels: resolution/

Please fix these problems and try again.

@mjeffryes mjeffryes added the resolution/fixed This issue was fixed label Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact/performance Something is slower than expected kind/bug Some behavior is incorrect or out of spec p1 A bug severe enough to be the next item assigned to an engineer resolution/fixed This issue was fixed
Projects
None yet
Development

No branches or pull requests

7 participants