[action] [PR:345] Fix snmp agent not-responding issue when high CPU utilization #347
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix sonic-net/sonic-buildimage#21314
The SNMP agent and MIB updaters are basic on Asyncio/Coroutine,
the mib updaters share the same Asyncio event loop with the SNMP agent client.
Hence during the updaters executing, the agent client can't receive/respond to new requests.
When the CPU utilization is high (In some stress test we make CPU 100% utilization), the updates are slow, and this causes the snmpd request to be timeout because the agent got suspended during updating.
- What I did
Decrease the MIB update frequency when the update execution is slow.
pros:
cons:
1.a ciscoSwitchQosMIB.QueueStatUpdater, generally finishs in 0.5-2s, expected to be delayed to 5-20s
1.b PhysicalTableMIBUpdater, generally finishs in 0.5-1.5s, expected to be delayed to 5-1.5s
1.c ciscoPfcExtMIB.PfcPrioUpdater, generally finishs in 0.5-3s, expected to be delayed to 5-30s
- How I did it
In get_next_update_interval, we compute the interval based on current execution time.
Roughly, we make the 'update interval'/'update execution time' >= UPDATE_FREQUENCY_RATE(10)
More specifically, if the execution time is 2.000001s, we sleep 21s before next update round.
And the max interval won't be longer than MAX_UPDATE_INTERVAL(60s).
- How to verify it
Test on Cisco chassis, test_snmp_cpu.py which triggers 100% CPU utilization test whether snmp requests work well.
- Description for the changelog