-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcdserver: change the snapshot + compact into sync operation #18283
etcdserver: change the snapshot + compact into sync operation #18283
Conversation
Signed-off-by: Clement <[email protected]>
Hi @clement2026. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
Here is the benchmark results:
Performance drops when the value size hits 2^13. I will perform CPU profiling for scenarios with large value sizes. The benchmarks were conducted on a cloud VM with 8 vCPUs and 16 GB of memory using the following script: export RATIO_LIST="4/1"
export REPEAT_COUNT=3
export RUN_COUNT=50000
echo RATIO_LIST=$RATIO_LIST
echo REPEAT_COUNT=$REPEAT_COUNT
echo RUN_COUNT=$RUN_COUNT
date; cd ~/etcd-sync/tools/rw-heatmaps && ./rw-benchmark.sh && cd ~/etcd/tools/rw-heatmaps && sleep 30 &&./rw-benchmark.sh; date According to the log, the task started at |
Nevermind, incorrectly read it. |
Actually, my observations show a performance improvement of up to 30%. The |
Oh yea, sorry I read it incorrectly. I just have been reading benchmark results which show average request duration, and not throughput so made me think more is worse. |
lol. I got it. It happens.🤪 As the graph indicates a performance drop with larger value sizes, I am running the |
SummaryHere are the results of the 4 benchmarks performed using the
DetailsHardware
Script export RATIO_LIST="4/1"
export REPEAT_COUNT=3
export RUN_COUNT=50000
./rw-benchmark.sh Test 1
Test 2
Test 3
Test 4
|
I ran multiple CPU profiles with different Since this patch tends to increase throughput, so higher CPU usage wasn’t surprising. These results didn't give me a Anyway, I’m sharing these results here and would love to know what you think before I dig deeper. CPU Time Usage
Script |
cc @ahrtr |
Thanks @clement2026 for the test report. The throughput increase up to 30% is a little weird. Theoretically, the performance should be very close. From implementation perspective, the only possible reason for the throughput increase I can think of could be due to the code snippet below, which won't be executed anymore in this PR. Could you double confirm this to help us have a better understanding? e.g. temporarily remove the code snippet on main branch, and then compare with this PR again. etcd/server/etcdserver/server.go Lines 2432 to 2440 in b433760
The CPU usage is very close, which looks fine. |
@ahrtr The 30% increase is really puzzling for me too. Can't wait to do the comparison and see what we find out. |
SummaryPlease disregard the earlier benchmark results. They were incorrect. Here are the reliable ones. Each branch was tested multiple times, with
It seems this PR/patch doesn't show significant performance changes. The benchmarks were conducted using the following script on a cloud VM with 8 vCPUs and 16 GB RAM. export RATIO_LIST="4/1"
export REPEAT_COUNT=3
export RUN_COUNT=50000
date; cd ~/etcd/tools/rw-heatmaps && ./rw-benchmark.sh; date; Details@ahrtr You were right about the strange 30% increase. The 30% turns out to be wrong data from my faulty script:
When running this script to benchmark 2 branches, the second one always shows the 30% drop in performance. I’m not sure if it’s a machine issue, as I didn't see unusual I/O, swap, or CPU activity after each benchmark. Anyway, I managed to get solid benchmark results by rebooting the machine after each run. Below are benchmark details. Machine was rebooted after each benchmark. Test 1Benchmark main branch for 3 times to ensure the results is reliable.
Test 2Benchmark this PR/patch twice Test 3Benchmark #18283 (comment) twice. Code is here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @clement2026 for the hard & nice work!
It seems this PR/patch doesn't show significant performance changes.
This seems reasonable, and it aligns with our understanding.
Almost all the heatmap diagrams have very similar color distribution, so it's very clear that they have very similar performance data.
A separate but related topic... I still think it's worthwhile to implement the line charts (see #15060) as another visualisation method, which is clearer for comparison when there is bigger performance difference. cc @ivanvc
Hey @ahrtr, I actually have a branch with this change, which I was working on some months ago. However, because of other tasks, I haven't been able to revisit it. I'll try to get back to this soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks, @clement2026.
Thanks @clement2026 for thorough investigation. Exemplary work!
We should note this and keep in mind when doing any future performance testing. Would be worth figuring out how we can protect against such cases. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahrtr, clement2026, ivanvc, serathius The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@serathius @ahrtr
Per the suggestion in (#18235 (comment)), I have changed the snapshot and compact operations to a synchronous process for simplification.
As a first step, I just removed
s.GoAttach(func() {})
.I will add benchmark results once all tests pass.