Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: change the snapshot + compact into sync operation #18283

Conversation

clement2026
Copy link
Contributor

@serathius @ahrtr
Per the suggestion in (#18235 (comment)), I have changed the snapshot and compact operations to a synchronous process for simplification.

As a first step, I just removed s.GoAttach(func() {}).

I will add benchmark results once all tests pass.

@k8s-ci-robot
Copy link

Hi @clement2026. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@clement2026 clement2026 marked this pull request as ready for review July 4, 2024 17:54
@henrybear327
Copy link
Contributor

/ok-to-test

@clement2026
Copy link
Contributor Author

clement2026 commented Jul 5, 2024

Here is the benchmark results:

  • Reading performance change: -1.67% ~ 27.31%
  • Writing performance change: -1.12% ~ 28.64%

Performance drops when the value size hits 2^13. I will perform CPU profiling for scenarios with large value sizes.

compare_read
main.csv
patch.csv

The benchmarks were conducted on a cloud VM with 8 vCPUs and 16 GB of memory using the following script:

export RATIO_LIST="4/1"
export REPEAT_COUNT=3
export RUN_COUNT=50000
echo RATIO_LIST=$RATIO_LIST
echo REPEAT_COUNT=$REPEAT_COUNT
echo RUN_COUNT=$RUN_COUNT
date; cd ~/etcd-sync/tools/rw-heatmaps && ./rw-benchmark.sh && cd ~/etcd/tools/rw-heatmaps &&  sleep 30 &&./rw-benchmark.sh; date

According to the log, the task started at Thu Jul 4, 2024, 07:29:48 PM UTC and finished at Thu Jul 4, 2024, 11:43:24 PM UTC, taking a total of 4 hours and 13 minutes.
nohup.out.txt

@serathius
Copy link
Member

serathius commented Jul 5, 2024

That doesn't look very good, making snapshot sync creates up to 30% regression. cc @ahrtr

Nevermind, incorrectly read it.

@purr100
Copy link

purr100 commented Jul 5, 2024

That doesn't look very good, making snapshot sync creates up to 30% regression. cc @ahrtr

Actually, my observations show a performance improvement of up to 30%. The patch.csv shows higher throughput. @serathius Could you please recheck?😬

@serathius
Copy link
Member

serathius commented Jul 5, 2024

Actually, my observations show a performance improvement of up to 30%. The patch.csv shows higher throughput. @serathius Could you please recheck?😬

Oh yea, sorry I read it incorrectly. I just have been reading benchmark results which show average request duration, and not throughput so made me think more is worse.

@purr100
Copy link

purr100 commented Jul 5, 2024

Oh yea, sorry I read it incorrectly. I just have been reading benchmark results which show average request duration, and not throughput so made me think more is worse.

lol. I got it. It happens.🤪

As the graph indicates a performance drop with larger value sizes, I am running the rw-benchmark.sh script with larger value sizes to verify this issue.

@clement2026
Copy link
Contributor Author

Summary

Here are the results of the 4 benchmarks performed using the rw-benchmark.sh script.

Value Size Range Performance Change
256 B ~ 16 KB read -1.67% ~ 27.31%
write -1.12% ~ 28.64%
256 B ~ 16 KB read -0.67% ~ 30.40%
write -1.32% ~ 30.71%
256 B ~ 32 KB read 3.68% ~ 33.13%
write 3.00% ~ 34.37%
8 KB ~ 32 KB read 0.11% ~ 20.38%
write 0.97% ~ 21.13%

Details

Hardware

  • Test 1 was conducted on a cloud VM with 8 vCPUs and 16 GB RAM.
  • The remaining 3 tests were conducted on cloud VMs with 8 vCPUs and 32 GB RAM.

Script
All 4 tests use this script but differ in their VALUE_SIZE_POWER_RANGE variable.

export RATIO_LIST="4/1"
export REPEAT_COUNT=3
export RUN_COUNT=50000
./rw-benchmark.sh

Test 1

export VALUE_SIZE_POWER_RANGE="8 14"

compare_read

main.csv patch.csv

Test 2

export VALUE_SIZE_POWER_RANGE="8 14"

compare_read

main.csv patch.csv

Test 3

export VALUE_SIZE_POWER_RANGE="8 15"

compare_read

main.csv patch.csv

Test 4

export VALUE_SIZE_POWER_RANGE="13 15"

compare_read

main.csv patch.csv

@clement2026
Copy link
Contributor Author

I ran multiple CPU profiles with different value size and connection count. The results show that MVCC operations
like mvcc.(*keyIndex).get, mvcc.(*keyIndex).isEmpty, and mvcc.(*keyIndex).findGeneration significantly impact
total CPU time. Other functions worth noting are runtime.memmove, syscall.Syscall6, and cmpbody.

Since this patch tends to increase throughput, so higher CPU usage wasn’t surprising. These results didn't give me a
clear conclusion. To better understand the issue, I should have collected and compared CPU profile data when the patch
showed lower throughput. Unfortunately, I didn't record the throughput during the CPU profiles.

Anyway, I’m sharing these results here and would love to know what you think before I dig deeper.

CPU Time Usage

Connection Count Value Size Range Main Patch Change in CPU Time Usage Files
32 16 KB 322.51s 331.32s 2.73% main.pb.gz patch.pb.gz
32 32 KB 467.14s 460.28s -1.47% main.pb.gz patch.pb.gz
32 64 KB 596.94s 588.16s -1.47% main.pb.gz patch.pb.gz
1024 16 KB 319.78s 332.02s 3.83% main.pb.gz patch.pb.gz
1024 32 KB 424.93s 435.31s 2.44% main.pb.gz patch.pb.gz
1024 64 KB 544.16s 547.28s 0.57% main.pb.gz patch.pb.gz

Script
run.sh.zip
All these tests use this script but with different VALUE_SIZE and CONN_CLI_COUNT

@serathius
Copy link
Member

cc @ahrtr

@ahrtr
Copy link
Member

ahrtr commented Jul 15, 2024

Thanks @clement2026 for the test report. The throughput increase up to 30% is a little weird. Theoretically, the performance should be very close.

From implementation perspective, the only possible reason for the throughput increase I can think of could be due to the code snippet below, which won't be executed anymore in this PR. Could you double confirm this to help us have a better understanding? e.g. temporarily remove the code snippet on main branch, and then compare with this PR again.

s.wgMu.RLock() // this blocks with ongoing close(s.stopping)
defer s.wgMu.RUnlock()
select {
case <-s.stopping:
lg := s.Logger()
lg.Warn("server has stopped; skipping GoAttach")
return
default:
}

The CPU usage is very close, which looks fine.

@clement2026
Copy link
Contributor Author

From implementation perspective, the only possible reason for the throughput increase I can think of could be due to the code snippet below, which won't be executed anymore in this PR. Could you double confirm this to help us have a better understanding? e.g. temporarily remove the code snippet on main branch, and then compare with this PR again.

@ahrtr The 30% increase is really puzzling for me too. Can't wait to do the comparison and see what we find out.

@clement2026
Copy link
Contributor Author

Summary

Please disregard the earlier benchmark results. They were incorrect. Here are the reliable ones. Each branch was tested multiple times, with main 01 as the baseline.

Branch Performance Change
main 01 read -
write -
main 02 read [-5.38%, 6.66%]
write [-5.09%, 6.52%]
main 03 read [-4.45%, 7.12%]
write [-3.82%, 7.20%]
patch 01 read [-3.49%, 5.95%]
write [-4.78%, 6.40%]
patch 02 read [-4.68%, 4.62%]
write [-5.07%, 6.42%]
remove-rwlock 01(based on #18283 (comment)) read [-3.41%, 4.79%]
write [-3.87%, 5.34%]
remove-rwlock 02 read [-5.34%, 4.81%]
write [-5.74%, 6.65%]

It seems this PR/patch doesn't show significant performance changes.

The benchmarks were conducted using the following script on a cloud VM with 8 vCPUs and 16 GB RAM.

export RATIO_LIST="4/1"
export REPEAT_COUNT=3
export RUN_COUNT=50000
date; cd ~/etcd/tools/rw-heatmaps && ./rw-benchmark.sh; date;

Details

@ahrtr You were right about the strange 30% increase. The 30% turns out to be wrong data from my faulty script:

date; cd ~/etcd-sync/tools/rw-heatmaps && ./rw-benchmark.sh && cd ~/etcd/tools/rw-heatmaps &&  sleep 30 &&./rw-benchmark.sh; date

When running this script to benchmark 2 branches, the second one always shows the 30% drop in performance. I’m not sure if it’s a machine issue, as I didn't see unusual I/O, swap, or CPU activity after each benchmark.

Anyway, I managed to get solid benchmark results by rebooting the machine after each run. Below are benchmark details. Machine was rebooted after each benchmark.

Test 1

Benchmark main branch for 3 times to ensure the results is reliable.

main-01-vs-main-02
main-01-vs-main-03
main-01.csv main-02.csv main-03.csv

Test 2

Benchmark this PR/patch twice

main-01-vs-patch
main-01-vs-patch-02
patch.csv patch-02.csv

Test 3

Benchmark #18283 (comment) twice. Code is here

main-01-vs-remove-rwlock
main-01-vs-remove-rwlock-02
remove-rwlock.csv remove-rwlock-02.csv

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @clement2026 for the hard & nice work!

It seems this PR/patch doesn't show significant performance changes.

This seems reasonable, and it aligns with our understanding.

Almost all the heatmap diagrams have very similar color distribution, so it's very clear that they have very similar performance data.

A separate but related topic... I still think it's worthwhile to implement the line charts (see #15060) as another visualisation method, which is clearer for comparison when there is bigger performance difference. cc @ivanvc

@ahrtr
Copy link
Member

ahrtr commented Jul 23, 2024

cc @ivanvc @jmhbnz @serathius

@ivanvc
Copy link
Member

ivanvc commented Jul 23, 2024

A separate but related topic... I still think it's worthwhile to implement the line charts (see #15060) as another visualisation method, which is clearer for comparison when there is bigger performance difference. cc @ivanvc

Hey @ahrtr, I actually have a branch with this change, which I was working on some months ago. However, because of other tasks, I haven't been able to revisit it. I'll try to get back to this soon.

Copy link
Member

@ivanvc ivanvc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks, @clement2026.

@serathius
Copy link
Member

Thanks @clement2026 for thorough investigation. Exemplary work!

When running this script to benchmark 2 branches, the second one always shows the 30% drop in performance. I’m not sure if it’s a machine issue, as I didn't see unusual I/O, swap, or CPU activity after each benchmark.

We should note this and keep in mind when doing any future performance testing. Would be worth figuring out how we can protect against such cases.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, clement2026, ivanvc, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@serathius serathius merged commit 9a6c9ae into etcd-io:main Jul 24, 2024
51 checks passed
@clement2026 clement2026 deleted the change-snapshot-and-compact-into-sync-operation branch July 27, 2024 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

7 participants