Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在大规模场景下,降低agent对apiserver的访问压力 #253

Merged
merged 2 commits into from
Mar 26, 2024

Conversation

ypnuaa037
Copy link
Contributor

【背景问题】
在大规模集群场景下,发现agent对apiserver的压力很大。通过分析发现,主要源于两处:

  1. agent会不停的访问apiserver读取NodeLocalStorage,检测spdk配置,间隔只有100ms,因而对apiserver的访问频率 = 10 * N / s(N为节点数)
  2. agent会定时上报节点的NodeLocalStorage状态,无论NodeLocalStorage状态是否变化都会上报,上报周期是60s
    其中,问题1是主要压力来源

【解决思路】

  1. 降低访问频率,不需要100ms这么频繁访问
  2. 只在NodeLocalStorage发生变化时才上报,因为大部分时间是没有变化的,不需要上报

@CLAassistant
Copy link

CLAassistant commented Feb 20, 2024

CLA assistant check
All committers have signed the CLA.

@codecov-commenter
Copy link

Codecov Report

Attention: 23 lines in your changes are missing coverage. Please review.

Comparison is base (850073d) 32.35% compared to head (9275828) 31.94%.

Files Patch % Lines
pkg/agent/discovery/discovery.go 0.00% 22 Missing ⚠️
pkg/csi/nodeserver.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #253      +/-   ##
==========================================
- Coverage   32.35%   31.94%   -0.42%     
==========================================
  Files          41       41              
  Lines        6426     6443      +17     
==========================================
- Hits         2079     2058      -21     
- Misses       4058     4096      +38     
  Partials      289      289              
Flag Coverage Δ
unittests 31.94% <0.00%> (-0.42%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@peter-wangxu peter-wangxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

代码增加一个单元testcase


time.Sleep(time.Millisecond * 100)
time.Sleep(time.Second * 60)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果为60s,增加一块磁盘是不是也要1分钟才能生效?
如果是的话,改成1s也影响不大,毕竟都有差异化diff了

Copy link
Contributor Author

@ypnuaa037 ypnuaa037 Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果为60s,增加一块磁盘是不是也要1分钟才能生效? 如果是的话,改成1s也影响不大,毕竟都有差异化diff了

对大规模集群来说,1s感觉还是有点频繁,而且对于没有使用到spdk的场景来说就是无意义的检测。延长到3-5秒或改成时间间隔可配置,是否可以?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果为60s,增加一块磁盘是不是也要1分钟才能生效? 如果是的话,改成1s也影响不大,毕竟都有差异化diff了

在1000节点规模的集群里,我们实测安装agent比不安装要多消耗10个核的cpu,改成1秒理论上也有1核cpu的消耗

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以的改成3-5ok的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以的改成3-5ok的

已加单元测试,帮再review下,谢谢

The checkSPDKSupport frequency is too high, which puts a lot of pressure on apiserver in large cluster
@ypnuaa037 ypnuaa037 force-pushed the reduce-agent-access-to-apiserver branch from 9275828 to 6bc3046 Compare March 25, 2024 12:09
@peter-wangxu peter-wangxu merged commit 5ebf20f into alibaba:main Mar 26, 2024
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants