10-fold `manysearch` performance slowdown between v0.9.5 and v0.9.6 #463

ctb · 2024-10-02T12:55:16Z

When benchmarking #430, I noticed a massive slowdown in manysearch. I took a few days to track it down using our conda-forge releases and https://github.com/dib-lab/2022-branchwater-benchmarking.

See results in a_vs_d.txt under benchmarks/ for the various branches of the benchmark repo - summarized below:

plugin version	time	memory
v0.8.6	814s	18.3 GB
v0.9.0	1332s	27.1 GB
v0.9.1	1361s	22.4 GB
v0.9.5	1534s	21.1 GB
v0.9.6	17523s	27.2 GB
unreleased w/#430 & #464	1428.0	18.6 GB
unreleased w/#464	1392.6	20.1 GB
unreleased w/#464 and `--ignore-abundance`	1388.6	20.5 GB

I think the max memory is variable because it depends on loading order and duration of the time the sketches are held in memory.

The bigger concern, of course, is the 10x increase in time that happened with release v0.9.6.

The culprit is #302, whichs add abundance-weighted information to manysearch and introduced a performance regression around sketch downsampling. #464 fixes the regression and also makes abundance calculations user-toggleable.

On a separate note, there has been a regular increase in execution time over the releases - from v0.8.6 to v0.9.0, and again between v0.9.1 and v0.9.5.

I think the first of these two slowdowns might have to do with the switch to use core manifest loading, and in particular the need to read .sig files when using pathlists - once to build a manifest, and then again when actually loading the sketch for search. This might be something that we can fix when dealing with #445.

Not sure what changed between v0.9.1 and v0.9.5, though. Maybe I should run some more benchmarks to nail down the exact release where the time increase occurred. Or maybe we should just profile the heck out of the plugin. Or both.

The text was updated successfully, but these errors were encountered:

ctb · 2024-10-05T19:12:40Z

I used git bisect to confirm that it's #302 that causes the big bump. Still working to figure out why 😅

ctb · 2024-10-05T22:45:39Z

Ah-hah! Figured it out! I'll write it up in more detail, but KmerMinHash.downsample_scaled(...) executes the downsampling & creation of a new sketch even when it is not needed (e.g. if the scaled is already ok), and it was being run on every sketch!

ctb · 2024-10-06T01:07:39Z

(Fixed in #464)

ctb · 2024-10-26T13:11:33Z

updated benchmarks as of v0.9.8 here: #479. Loading via .sig.zip and manifest CSVs seems to resolve the slowdown since v0.8.6 too. 🎉

ctb mentioned this issue Oct 2, 2024

MRG: fix performance regression in manysearch by removing unnecessary downsampling #464

Merged

This was referenced Oct 6, 2024

downsample_* functions in minhash.rs _always_ downsample, even when downsampling is not necessary sourmash-bio/sourmash#3343

Closed

MRG: add generic support for any type of sketch collection as query or database #430

Merged

ctb mentioned this issue Oct 13, 2024

MRG: avoid clones by using new Signature::try_into() -> KmerMinHash #471

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

10-fold `manysearch` performance slowdown between v0.9.5 and v0.9.6 #463

10-fold `manysearch` performance slowdown between v0.9.5 and v0.9.6 #463

ctb commented Oct 2, 2024 •

edited

Loading

ctb commented Oct 5, 2024

ctb commented Oct 5, 2024

ctb commented Oct 6, 2024

ctb commented Oct 26, 2024

10-fold manysearch performance slowdown between v0.9.5 and v0.9.6 #463

10-fold manysearch performance slowdown between v0.9.5 and v0.9.6 #463

Comments

ctb commented Oct 2, 2024 • edited Loading

ctb commented Oct 5, 2024

ctb commented Oct 5, 2024

ctb commented Oct 6, 2024

ctb commented Oct 26, 2024

10-fold `manysearch` performance slowdown between v0.9.5 and v0.9.6 #463

10-fold `manysearch` performance slowdown between v0.9.5 and v0.9.6 #463

ctb commented Oct 2, 2024 •

edited

Loading