Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: further optimize archive workflow listing. Fixes #13601 #13819

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

MasonM
Copy link
Contributor

@MasonM MasonM commented Oct 26, 2024

Fixes #13601

Motivation

Listing archived workflows can be slow if you have a very large number of workflows (100,000+), or the average workflow size is high (100KB), even after the optimizations from #13566 and #13779. This makes some additional optimizations that speed up the queries by ~90% on MySQL and ~50% on PostgreSQL.

Modifications

The bottleneck for these queries depended on whether you use MySQL or PostgreSQL, each of which required a different fix. For PostgreSQL, the bottleneck was detoasting overhead, as explained in #13601 (comment). The fix was to use a common table expression to reduce the amount of times workflow needs to be detoasted, as suggested by @kodieg in #13601 (comment). The new query looks like this:

WITH workflows AS (
  SELECT
    "name",
    "namespace",
    "uid",
    "phase",
    "startedat",
    "finishedat",
    coalesce(workflow->'metadata', '{}') as metadata,
    coalesce(workflow->'status', '{}') as status,
    workflow->'spec'->>'suspend' as suspend
  FROM "argo_archived_workflows"
  WHERE (("clustername" = $1 AND "namespace" = $2 AND "instanceid" = $3))
  ORDER BY "startedat" DESC
  LIMIT 100
) (
  SELECT
    "name",
    "namespace",
    "uid",
    "phase",
    "startedat",
    "finishedat",
    coalesce(metadata->>'labels', '{}') as labels,
    coalesce(metadata->>'annotations', '{}') as annotations,
    coalesce(status->>'progress', '') as progress,
    coalesce(metadata->>'creationTimestamp', '') as creationtimestamp,
    "suspend",
    coalesce(status->>'message', '') as message,
    coalesce(status->>'estimatedDuration', '0') as estimatedduration,
    coalesce(status->>'resourcesDuration', '{}') as resourcesduration
  FROM "workflows"
)

For MySQL, the bottleneck was the optimizer inexplicably refusing to use the argo_archived_workflows_i4 index and instead using the primary key, which is much more expensive. As explained by @Danny5487401 in #13563 (comment), two ways of solving that are using FORCE INDEX or adding a union index on (clustername, startedat). Using FORCE INDEX is slightly hacky, and adding a new index is needlessly wasteful when we already have argo_archived_workflows_i4, so I opted to modify that index to cover (clustername, startedat). The new query looks like this:

SELECT
  `name`,
  `namespace`,
  `uid`,
  `phase`,
  `startedat`,
  `finishedat`,
  coalesce(workflow->'$.metadata.labels', '{}') as labels,
  coalesce(workflow->'$.metadata.annotations', '{}') as annotations,
  coalesce(workflow->>'$.status.progress', '') as progress,
  coalesce(workflow->>'$.metadata.creationTimestamp', '') as creationtimestamp,
  workflow->>'$.spec.suspend',
  coalesce(workflow->>'$.status.message', '') as message,
  coalesce(workflow->>'$.status.estimatedDuration', '0') as estimatedduration,
  coalesce(workflow->'$.status.resourcesDuration', '{}') as resourcesduration
FROM `argo_archived_workflows`
WHERE ((`clustername` = ?  AND `namespace` = ? AND `instanceid` = ?))
ORDER BY `startedat` DESC
LIMIT 100

Verification

First, I used #13715 to generate 100,000 randomized workflows, with https://gist.github.com/MasonM/52932ff6644c3c0ccea9e847780bfd90 as a template:

  • PostgreSQL - time go run ./hack/db fake-archived-workflows --template "@very-large-workflow.yaml" --rows 100000
  • MySQL - time go run ./hack/db fake-archived-workflows --template "@very-large-workflow.yaml" --rows 100000 -d 'mysql:password@tcp/argo'

Then, I ran make BenchmarkWorkflowArchive once on the main branch and once on this branch (with migration applied), and used benchstat to compare:

  • PostgreSQL
    $ benchstat postgres_before.txt postgres_after2.txt 
    goos: linux
    goarch: amd64
    pkg: github.com/argoproj/argo-workflows/v3/test/e2e
    cpu: 12th Gen Intel(R) Core(TM) i5-12400
                                                         │ postgres_before.txt │        postgres_after2.txt         │
                                                         │       sec/op        │   sec/op     vs base               │
    WorkflowArchive/ListWorkflows-12                              25.110m ± 1%   9.694m ± 3%  -61.39% (p=0.002 n=6)
    WorkflowArchive/ListWorkflows_with_label_selector-12           26.14m ± 3%   11.06m ± 3%  -57.70% (p=0.002 n=6)
    WorkflowArchive/CountWorkflows-12                              11.84m ± 1%   11.89m ± 2%        ~ (p=0.310 n=6)
    geomean                                                        19.81m        10.84m       -45.26%
    
                                                         │ postgres_before.txt │        postgres_after2.txt         │
                                                         │        B/op         │     B/op      vs base              │
    WorkflowArchive/ListWorkflows-12                              497.2Ki ± 0%   499.2Ki ± 0%  +0.40% (p=0.002 n=6)
    WorkflowArchive/ListWorkflows_with_label_selector-12          504.0Ki ± 0%   504.9Ki ± 0%  +0.18% (p=0.002 n=6)
    WorkflowArchive/CountWorkflows-12                             8.938Ki ± 1%   8.948Ki ± 2%       ~ (p=0.623 n=6)
    geomean                                                       130.8Ki        131.1Ki       +0.23%
    
                                                         │ postgres_before.txt │         postgres_after2.txt         │
                                                         │      allocs/op      │  allocs/op   vs base                │
    WorkflowArchive/ListWorkflows-12                               8.370k ± 0%   8.393k ± 0%  +0.27% (p=0.002 n=6)
    WorkflowArchive/ListWorkflows_with_label_selector-12           8.405k ± 0%   8.428k ± 0%  +0.26% (p=0.002 n=6)
    WorkflowArchive/CountWorkflows-12                               212.0 ± 0%    212.0 ± 0%       ~ (p=1.000 n=6) ¹
    geomean                                                        2.462k        2.466k       +0.18%
    ¹ all samples are equal
    
  • MySQL
    $ benchstat mysql_before.txt mysql_after.txt 
    goos: linux
    goarch: amd64
    pkg: github.com/argoproj/argo-workflows/v3/test/e2e
    cpu: 12th Gen Intel(R) Core(TM) i5-12400
                                                         │ mysql_before.txt │           mysql_after.txt           │
                                                         │      sec/op      │    sec/op     vs base               │
    WorkflowArchive/ListWorkflows-12                           43.510m ± 4%   1.650m ± 20%  -96.21% (p=0.002 n=6)
    WorkflowArchive/ListWorkflows_with_label_selector-12       69.646m ± 2%   4.513m ± 24%  -93.52% (p=0.002 n=6)
    WorkflowArchive/CountWorkflows-12                           27.37m ± 4%   28.98m ± 24%        ~ (p=0.394 n=6)
    geomean                                                     43.61m        5.998m        -86.25%
    
                                                         │ mysql_before.txt │          mysql_after.txt           │
                                                         │       B/op       │     B/op      vs base              │
    WorkflowArchive/ListWorkflows-12                           477.3Ki ± 0%   457.6Ki ± 0%  -4.13% (p=0.002 n=6)
    WorkflowArchive/ListWorkflows_with_label_selector-12       488.1Ki ± 0%   461.0Ki ± 0%  -5.54% (p=0.002 n=6)
    WorkflowArchive/CountWorkflows-12                          7.937Ki ± 1%   7.944Ki ± 0%       ~ (p=0.461 n=6)
    geomean                                                    122.7Ki        118.8Ki       -3.22%
    
                                                         │ mysql_before.txt │           mysql_after.txt           │
                                                         │    allocs/op     │  allocs/op   vs base                │
    WorkflowArchive/ListWorkflows-12                            8.224k ± 0%   7.705k ± 0%  -6.31% (p=0.002 n=6)
    WorkflowArchive/ListWorkflows_with_label_selector-12        8.277k ± 0%   7.732k ± 0%  -6.58% (p=0.002 n=6)
    WorkflowArchive/CountWorkflows-12                            205.0 ± 0%    205.0 ± 0%       ~ (p=1.000 n=6) ¹
    geomean                                                     2.408k        2.303k       -4.35%
    ¹ all samples are equal
    

@@ -39,6 +39,7 @@ E2E_PARALLEL ?= 20
E2E_SUITE_TIMEOUT ?= 15m
GOTEST ?= go test -v -p 20
ALL_BUILD_TAGS ?= api,cli,cron,executor,examples,corefunctional,functional,plugins
BENCHMARK_COUNT ?= 6
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so benchstat gives confidence intervals. Without this, it'll show the message ¹ need >= 6 samples for confidence interval at level 0.95

@MasonM MasonM marked this pull request as ready for review October 26, 2024 02:06
ansiSQLChange(`drop index argo_archived_workflows_i4 on argo_archived_workflows`),
ansiSQLChange(`drop index argo_archived_workflows_i4`),
),
ansiSQLChange(`create index argo_archived_workflows_i4 on argo_archived_workflows (clustername, startedat)`),
Copy link
Member

@agilgur5 agilgur5 Oct 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how quick does this run? quick enough to include in a patch version? i.e. to cherry-pick into 3.5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very quickly. Tested after populating my DB with 100,000 rows, all associated with the same clustername:

MySQL:

$ time go run ./hack/db migrate -d 'mysql:password@tcp/argo'
INFO[0000] Migrating database schema                     clusterName=default dbType=mysql
INFO[0000] applying database change                      change="drop index argo_archived_workflows_i4 on argo_archived_workflows" changeSchemaVersion=61
INFO[0000] applying database change                      change="create index argo_archived_workflows_i4 on argo_archived_workflows (clustername, startedat)" changeSchemaVersion=62
2024/10/26 02:24:13     Session ID:     00001
        Query:          create index argo_archived_workflows_i4 on argo_archived_workflows (clustername, startedat)
        Stack:          
                fmt.(*pp).handleMethods@/usr/local/go/src/fmt/print.go:673
                fmt.(*pp).printArg@/usr/local/go/src/fmt/print.go:756
                fmt.(*pp).doPrint@/usr/local/go/src/fmt/print.go:1208
                fmt.Append@/usr/local/go/src/fmt/print.go:289
                log.(*Logger).Print.func1@/usr/local/go/src/log/log.go:261
                log.(*Logger).output@/usr/local/go/src/log/log.go:238
                log.(*Logger).Print@/usr/local/go/src/log/log.go:260
                github.com/argoproj/argo-workflows/v3/persist/sqldb.ansiSQLChange.apply@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/ansi_sql_change.go:11
                github.com/argoproj/argo-workflows/v3/persist/sqldb.migrate.applyChange.func1@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/migrate.go:301
                github.com/argoproj/argo-workflows/v3/persist/sqldb.migrate.applyChange@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/migrate.go:290
                github.com/argoproj/argo-workflows/v3/persist/sqldb.migrate.Exec@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/migrate.go:279
                main.NewMigrateCommand.func1@/home/vscode/go/src/github.com/argoproj/argo-workflows/hack/db/main.go:50
                github.com/spf13/cobra.(*Command).execute@/home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:985
                github.com/spf13/cobra.(*Command).ExecuteC@/home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:1117
                github.com/spf13/cobra.(*Command).Execute@/home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:1041
                main.main@/home/vscode/go/src/github.com/argoproj/argo-workflows/hack/db/main.go:39
                runtime.main@/usr/local/go/src/runtime/proc.go:272
                runtime.goexit@/usr/local/go/src/runtime/asm_amd64.s:1700
        Rows affected:  0
        Last insert ID: 0
        Error:          upper: slow query
        Time taken:     0.95033s
        Context:        context.Background


real    0m1.758s
user    0m1.074s
sys     0m0.294s

PostgreSQL:

 $ time go run ./hack/db migrate
INFO[0000] Migrating database schema                     clusterName=default dbType=postgres
INFO[0000] applying database change                      change="drop index argo_archived_workflows_i4" changeSchemaVersion=61
INFO[0000] applying database change                      change="create index argo_archived_workflows_i4 on argo_archived_workflows (clustername, startedat)" changeSchemaVersion=62

real    0m0.868s
user    0m1.117s
sys     0m0.271s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dope, great work again!

Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge thanks for deep-diving & analyzing this, describing it, and testing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3.5: Further optimize Archive List API call / DB query
2 participants