Fix performance baseline breach #2799

justinvp · 2023-08-25T14:44:16Z

With pulumi/pulumi#13654, I bumped the aws-ts-s3-folder threshold from 4500 to 4800 after local investigation.

Now we're getting alerted that aws-js-s3-folder, aws-py-s3-folder, and aws-ts-s3-folder are over their thresholds for this test run. Needs investigation. Was it just a blip or consistent issue?

The text was updated successfully, but these errors were encountered:

justinvp · 2023-09-09T04:53:46Z

It looks like there were a few random blips, but once pulumi/examples#1471 was merged to upgrade the examples from v5 to v6 of pulumi-aws on 8/31 at 7:33 AM PT, we spiked. Those spikes are all 9/1.

EvanBoyle · 2023-09-09T11:43:24Z

@justinvp I wonder if that is just a coincidence. We've been on an alpha version of AWS v6 in examples for much of August: https://github.com/pulumi/examples/commits/master/aws-js-s3-folder/package.json

EvanBoyle · 2023-09-09T11:44:32Z

v3.80.0 also release on 8/31 https://github.com/pulumi/pulumi/releases/tag/v3.80.0

justinvp · 2023-09-09T16:45:15Z

We've been on an alpha version of AWS v6 in examples for much of August:

@EvanBoyle, the commit history is deceiving. Those commits were part of the long open pulumi/examples#1471 PR, which wasn't squashed when merged. The PR started in early August, but didn't get merged into master until 8/31.

justinvp · 2023-09-09T16:54:53Z

Trying out aws-js-s3-folder locally with the same version of the CLI (dev build of 3.81.1), I'm seeing a consistent difference (a couple seconds) between running an empty update with v5 vs v6.

mjeffryes · 2023-09-14T23:57:01Z

I've collected some trace data over several runs with a few different variations: pulumi-aws/issues/2799 latency investigation

A few observations:

The latency increase is present in an empty update.
There is a smaller latency increase for both preview-only and update-no-preview commands
The latency increase is tied to the plugin version used and is not affected by the sdk version
The latency delta between versions does not seem to be influenced by how many resources you are creating
Most confounding to me so far is that the traces don't show a significant increase in latency for Check or Diff (or any other provider method), but the parent span(s) for RegisterResource do show the latency.

mjeffryes · 2023-09-15T17:20:04Z

I think I have a lead on where the slowness may be coming from:
There's a significant change in the spacing in /pulumirpc.ResourceMonitor/RegisterResource on either side of the /pulumirpc.ResourceProvider/CheckConfig trace for v6:

There's no subspan covering this time, but I suspect that it is where we are waiting on the provider plugin to start. Playing around with starting each plugin in my terminal, it does feel like there's a larger delay before printing the port number for v6. The difference is subtle (the traces indicate a difference of ~650ms), but since we restart the plugin multiple times during an update, it's probably enough to account for the ~3s increase in the overall update time.

(We should probably add a tracing span to cover provider start times; if my hunch is correct, that trace could have saved me several hours of experimenting with different profiling runs.)

mjeffryes · 2023-09-15T23:37:27Z

Further testing found that the v6 aws plugin does take ~500ms longer just to get printing the version string. The flame graphs shows a significant amount of time spent in tfbridge.NewProviderMetadata and tfbridge.MustApplyAutoAliases:

@t0yv0 confirmed that these are new to v6 as v6 is the first version of the aws provider that uses the tfbridge tools for bridging Provider Framework based tf providers. (v5 contained a number of patches to downgrade resources that the upstream had moved to PF back to tf SDKv2)

I think that gives us the answer to our initial question of what is causing the slow-down, but there are a number of different items to follow up on to try to improve the situation:

A large portion of the extra time is spend unmarshalling the large (45MB) bridge-metadata.json file:

The renames key of the metadata is not required at runtime; stripping this from the embedded meta data should reduce the amount of data we unmarshall by 1/3 (Trim extraneous keys from bridge-metadata.json pulumi-terraform-bridge#1376)
Using a more efficient storage format for bridge-metadata.json could cut most of the overhead of loading this file (Try using a more efficient JSON decoder for reading bridge metatada pulumi-terraform-bridge#1377 or if that's not enough Evaluate using a more compact/efficient storage protocol pulumi-terraform-bridge#1378)

We could also:

Take a lazy approach to loading bridge metadata so that we don't eat the whole cost of loading the file on init. To avoid simply shifting the cost around, we might need a storage format that facilitates random access.

Regardless of the approach we take to reducing init time; we should also:

Fix tracing so that we get spans from the provider plugin and spans for when the engine is waiting on plugin init. (Restore spans from plugin RPCs, pluging startup delay in trace file. pulumi#13986)
Understand why the engine appears to be restarting the plugin between CheckConfig and Configure Investigate reusing plugin process across steps· pulumi/13987

mikhailshilkov · 2023-09-18T12:01:26Z

It sounds like the source of impact is in our TFPF implementation, which is "expected" in some sense.

@mjeffryes How does a 500ms delay in plugin loading become a 3-second jump on all the charts in the issue description? Are we loading the plugin multiple times.

45 MB sounds like a lot! I think that's comparable to metadata in Azure Native, which is an entirely different thing, but that provider has so much more resources and resource properties... Do we see ways to reduce that? For instance, I see things like RuleGroupRuleStatementAndStatementStatementNotStatementStatementRegexPatternSetReferenceStatementFieldToMatchHeader in it. Does it mean our WafV2 rewrites aren't working for metadata?

What is your recommendation to proceed? Is there a low-hanging fruit, or do we need to park it until the next iteration? Is it still a P1?

Could you pleas open issue for the next steps?

t0yv0 · 2023-09-18T13:06:12Z

Glad you could take a look here, Wafv2 rewrites not applied is a good catch! I think Matthew's list above is our low hanging fruit (the first two items). Rid of renames and try a faster format or else even just a faster deserializer like drop in https://github.com/json-iterator/go that we use in p/p

mikhailshilkov · 2023-09-18T13:45:00Z

Referencing pulumi/pulumi-azure-native#1689 here as some related prior art. @mjeffryes @t0yv0 maybe we can reuse some of those ideas.

EvanBoyle · 2023-09-29T20:07:10Z

We are not done here until the perf benchmarks are passing again. I've renamed the issue to reflect that and am adding back the P1 label.

mikhailshilkov · 2023-10-02T14:35:20Z

Next steps:

justinvp · 2023-10-17T21:05:34Z

We temporarily adjusted the alerting baselines until this issue has been resolved.

'pulumi-update-empty', 'aws-py-s3-folder', 4800 => 5800
'pulumi-update-empty', 'aws-js-s3-folder', 4500 => 5800
'pulumi-update-empty', 'aws-ts-s3-folder', 4800 => 5800

Part of addressing this issue is adjusting these back to lower than the previous values, given we're now reusing plugin processes across steps (pulumi/pulumi#13987).

blampe · 2023-11-08T19:43:17Z

@mjeffryes @t0yv0 I spent some time hacking on a prototype for #2799 (comment) and wanted to share some findings.

First, to recap, I was curious to see if there's any benefit to moving some of our runtime deserialization down to the build/compile step. The schema and metadata we're embedding into the binary don't change, so (in theory) we should be able to build the result of our computation directly into the binary instead of re-computing it at runtime.

To set the baseline, benchmarking Provider() (with isTfGen true/false) on master gave me:

BenchmarkProviderGen-10    	       5	 228882600 ns/op	423606320 B/op	 2848954 allocs/op
BenchmarkProviderNoGen-10    	       5	 203570350 ns/op	347397752 B/op	 2308434 allocs/op

My first experiment was to slightly simplify but not eliminate the serialization altogether. In particular I noticed we only have 3 top-level keys. Instead of embedding the entire bridge-metadata.json struct we can generate a hack.go file which pulls out those fields, while still keeping the raw JSON around.

var Data := tfbridge.Data{
    M: map[string]json.RawMessage{
        "auto-aliasing": json.RawMessage([]byte("{...}")), // taken from bridge-metadata.json
        "mux": json.RawMessage([]byte("{...}")),
        "renamed": json.RawMessage([]byte("{...}")),
    },
}

Initializing the provider with this eliminates the initial deserialization of the ~40MB blob, while subsequent operations still need to deserialize their smaller respective blobs. This is a relatively small/non-invasive change and cuts new provider time in ~half:

BenchmarkProviderGen-10    	       9	 113264319 ns/op	303111697 B/op	 1918905 allocs/op
BenchmarkProviderNoGen-10    	      10	 102601083 ns/op	277265940 B/op	 1615543 allocs/op

The next question was whether we could eliminate the deserialization altogether. The structs involved are large enough that we bump into golang/go#33437 but after some wrestling with the compiler it is possible to embed everything natively. (That is, deserialize all of the metadata, run MustApplyAutoAliases, and persist the result in code.)

The result is >200k lines of Go (~50MB) which takes several minutes to compile. Fortunately we can use build tags in a way that lets us continue to toggle between the embed approach and this one (otherwise it would be a nightmare to develop against!). This came in at about ~1/4 of the baseline:

BenchmarkProviderGen-10    	      20	  51311821 ns/op	156296561 B/op	  883827 allocs/op
BenchmarkProviderNoGen-10    	      20	  52889356 ns/op	156296712 B/op	  883832 allocs/op

From here the bottleneck then becomes NewUpstreamProvider, which seems good? I'm not sure if we could do a similar trick with that one.

I have literally zero context on providers beyond the hacking I've done here, but I can push these branches if they would be useful!

mjeffryes · 2023-11-08T22:16:29Z

@blampe Those are some pretty nice results! It's a bit strange that you get the roughly the same for Gen vs NoGen in all runs. I suspect it might lying and always running in one mode or the other. But I think we'd definitely be interested in seeing your branch and trying to lock in those wins.

blampe · 2023-11-13T19:18:39Z

It's a bit strange that you get the roughly the same for Gen vs NoGen in all runs. I suspect it might lying and always running in one mode or the other.

For sure, I could have also (very likely) borked something while I was hacking!

I've pushed branches here and here.

The "baseline codegen" commit is the first improvement (skipping the initial deserialization), and HEAD is the second (much crazier) codegen everything approach.

Run go run cmd/codegen/main.go to generate the hack.go file. To build the full codegen approach you'll need to add a build tag go build -tags hack ./... -- that one took upwards of 10 minutes for me locally.

mjeffryes · 2023-11-30T18:02:21Z

With pulumi/pulumi-terraform-bridge#1524 we are now roughly back in line with the v5 performance Metabase

We still need to complete the performance testing in the providers to avoid shipping a regression like this in the future, but we can probably lower the alert thresholds and close this issue now.

t0yv0 · 2023-11-30T19:52:13Z

By all means, let's do that so we can close.

I wasn't aware we've improved here so I was doing some work on this in the iteration.

The changes in 1544 are a little invasive and I'm still verifying if they're safe.

BenchmarkRuntimeProvider is down to 50-60ms.

The runtime metadata is much smaller.

du -sh provider/cmd/pulumi-resource-aws/runtime-bridge-metadata.json                                                                                       ~/code/pulumi-aws
 88K    provider/cmd/pulumi-resource-aws/runtime-bridge-metadata.json

AutoAliasing is not a major contributor in the profile anymore.

t0yv0 · 2023-11-30T19:56:18Z

In terms of delivering more performance to our users around provider startup, going forward the super annoying cost seems to be in CheckConfig/Configure. Even with Luke's fix #3044 I'm getting something like 1s each for CheckConfig and Configure even in the basic cases. Setting this option as our perf-sensitive customers do:

aws:skipCredentialsValidation true

removes the cost of CheckConfig but not Configure.

I'm not sure it's reasonable to spend 1s there, I suspect it's all about credentials and AWS client config. By comparison, aws s3 ls takes 600ms to not only get authenticated but also retrieve the buckets. We should be able to do better there.

mjeffryes · 2023-11-30T23:44:19Z

Lowered the thresholds that we raised while investigating this issue:

'pulumi-update-empty', 'aws-py-s3-folder', 5800 -> 4800
'pulumi-update-empty', 'aws-js-s3-folder', 5800 -> 4800
'pulumi-update-empty', 'aws-ts-s3-folder', 5800 -> 4800

pulumi-bot · 2023-11-30T23:44:24Z

Cannot close issue:

does not have required labels: resolution/

Please fix these problems and try again.

justinvp added kind/bug Some behavior is incorrect or out of spec area/cli Impacts the Pulumi CLI impact/performance Something is slower than expected p1 A bug severe enough to be the next item assigned to an engineer labels Aug 25, 2023

justinvp added this to the 0.93 milestone Aug 25, 2023

justinvp self-assigned this Aug 25, 2023

justinvp modified the milestones: 0.93, 0.94 Sep 6, 2023

justinvp removed the area/cli Impacts the Pulumi CLI label Sep 11, 2023

justinvp removed their assignment Sep 11, 2023

justinvp added the needs-triage Needs attention from the triage team label Sep 11, 2023

justinvp removed this from the 0.94 milestone Sep 11, 2023

justinvp changed the title ~~CLI performance metric breach~~ Investigate perf regression in pulumi-aws v6 Sep 11, 2023

justinvp transferred this issue from pulumi/pulumi Sep 11, 2023

mikhailshilkov added p1 A bug severe enough to be the next item assigned to an engineer and removed p1 A bug severe enough to be the next item assigned to an engineer labels Sep 11, 2023

mikhailshilkov assigned mjeffryes Sep 11, 2023

mikhailshilkov removed the needs-triage Needs attention from the triage team label Sep 11, 2023

mikhailshilkov added this to the 0.94 milestone Sep 11, 2023

mjeffryes closed this as completed Sep 29, 2023

mjeffryes mentioned this issue Sep 29, 2023

Optimize the performance of MustApplyAutoAliases on provider init pulumi/pulumi-terraform-bridge#1410

Open

EvanBoyle reopened this Sep 29, 2023

EvanBoyle changed the title ~~Investigate perf regression in pulumi-aws v6~~ Fix performance baseline breach Sep 29, 2023

EvanBoyle added p1 A bug severe enough to be the next item assigned to an engineer and removed resolution/fixed This issue was fixed labels Sep 29, 2023

mikhailshilkov modified the milestones: 0.94, 0.95 Oct 2, 2023

mikhailshilkov unassigned mjeffryes Oct 2, 2023

mikhailshilkov assigned mjeffryes Oct 4, 2023

mikhailshilkov modified the milestones: 0.95, 0.96 Oct 26, 2023

mjeffryes mentioned this issue Nov 15, 2023

Further improvements in runtime metadata loading pulumi/pulumi-terraform-bridge#1524

Merged

mikhailshilkov modified the milestones: 0.96, 0.97 Nov 23, 2023

mjeffryes closed this as completed Nov 30, 2023

pulumi-bot reopened this Nov 30, 2023

mjeffryes added the resolution/fixed This issue was fixed label Nov 30, 2023

mikhailshilkov closed this as completed Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix performance baseline breach #2799

Fix performance baseline breach #2799

justinvp commented Aug 25, 2023 •

edited

Loading

justinvp commented Sep 9, 2023

EvanBoyle commented Sep 9, 2023

EvanBoyle commented Sep 9, 2023

justinvp commented Sep 9, 2023

justinvp commented Sep 9, 2023

mjeffryes commented Sep 14, 2023

mjeffryes commented Sep 15, 2023 •

edited

Loading

mjeffryes commented Sep 15, 2023 •

edited

Loading

mikhailshilkov commented Sep 18, 2023

t0yv0 commented Sep 18, 2023

mikhailshilkov commented Sep 18, 2023

EvanBoyle commented Sep 29, 2023

mikhailshilkov commented Oct 2, 2023

justinvp commented Oct 17, 2023 •

edited

Loading

blampe commented Nov 8, 2023

mjeffryes commented Nov 8, 2023

blampe commented Nov 13, 2023

mjeffryes commented Nov 30, 2023 •

edited

Loading

t0yv0 commented Nov 30, 2023

t0yv0 commented Nov 30, 2023

mjeffryes commented Nov 30, 2023

pulumi-bot commented Nov 30, 2023

Fix performance baseline breach #2799

Fix performance baseline breach #2799

Comments

justinvp commented Aug 25, 2023 • edited Loading

justinvp commented Sep 9, 2023

EvanBoyle commented Sep 9, 2023

EvanBoyle commented Sep 9, 2023

justinvp commented Sep 9, 2023

justinvp commented Sep 9, 2023

mjeffryes commented Sep 14, 2023

mjeffryes commented Sep 15, 2023 • edited Loading

mjeffryes commented Sep 15, 2023 • edited Loading

mikhailshilkov commented Sep 18, 2023

t0yv0 commented Sep 18, 2023

mikhailshilkov commented Sep 18, 2023

EvanBoyle commented Sep 29, 2023

mikhailshilkov commented Oct 2, 2023

justinvp commented Oct 17, 2023 • edited Loading

blampe commented Nov 8, 2023

mjeffryes commented Nov 8, 2023

blampe commented Nov 13, 2023

mjeffryes commented Nov 30, 2023 • edited Loading

t0yv0 commented Nov 30, 2023

t0yv0 commented Nov 30, 2023

mjeffryes commented Nov 30, 2023

pulumi-bot commented Nov 30, 2023

justinvp commented Aug 25, 2023 •

edited

Loading

mjeffryes commented Sep 15, 2023 •

edited

Loading

mjeffryes commented Sep 15, 2023 •

edited

Loading

justinvp commented Oct 17, 2023 •

edited

Loading

mjeffryes commented Nov 30, 2023 •

edited

Loading