You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 19, 2024. It is now read-only.
When dealing with a large number of disk devices (many hundreds), the Netdata agent runs into significan performance issues.
On my home server system, currently with 1136 device mapper nodes, it takes almost 90 seconds for the dashboard to load over a local network connection, almost five minutes for data to start appearing, and even once data shows up the display is very choppy.
On the same system, urning off data collection for virtual disks in the proc plugin results in the dashboard loading almost instantly, data starts showing up almost instantly, and everything is perfectly smooth.
I see a similar performance issues trying to load the node view in the cloud (though once it loads, data does start displaying much more quickly, and things are noticeably smoother), which suggests to me that the issue is in the agent and not the dashboard code.
Expected behavior
The dashboard loads smoothly, and displays data smoothly once it loads.
Steps to reproduce
Create a very large number of disk devices (LVM's raidintegrity functionality is useful for this, as a simple two-device RAID1 LV with raid integrity enabled creates a total of nine DM device nodes).
Try to load the local dashboard or cloud node view for the system.
Version: netdata v1.35.0-230-gd917f9831
Configure options: '--prefix=/opt/netdata/usr''--sysconfdir=/opt/netdata/etc''--localstatedir=/opt/netdata/var''--libexecdir=/opt/netdata/usr/libexec''--libdir=/opt/netdata/usr/lib''--with-zlib''--with-math''--with-user=netdata''--enable-cloud''--without-bundled-protobuf''--disable-dependency-tracking''CFLAGS=-static -O2 -I/openssl-static/include -pipe''LDFLAGS=-static -L/openssl-static/lib''PKG_CONFIG_PATH=/openssl-static/lib/pkgconfig'
Install type: manual-static
Binary architecture: x86_64
Features:
dbengine: YES
Native HTTPS: YES
Netdata Cloud: YES
ACLK Next Generation: YES
ACLK-NG New Cloud Protocol: YES
ACLK Legacy: NO
TLS Host Verification: YES
Machine Learning: YES
Stream Compression: YES
Libraries:
protobuf: YES (system)
jemalloc: NO
JSON-C: YES
libcap: NO
libcrypto: YES
libm: YES
tcalloc: NO
zlib: YES
Plugins:
apps: YES
cgroup Network Tracking: YES
CUPS: NO
EBPF: YES
IPMI: NO
NFACCT: NO
perf: YES
slabinfo: YES
Xen: NO
Xen VBD Error Tracking: NO
Exporters:
AWS Kinesis: NO
GCP PubSub: NO
MongoDB: NO
Prometheus Remote Write: YES
Additional info
I suspect this to be a combination of two issues, one in the parsing code for /proc/diskstats (not particularly bad there, but enough that it does have a measurable impact), and one in the dbengine query code.
I can provide more detailed information as needed.
The text was updated successfully, but these errors were encountered:
On further inspection, it appears this is an issue with the dashboard code specifically, albeit probably at least partially caused by the design of the REST API.
Manually querying /api/v1/charts on the system in question gets a complete response in less than one second. Trying to load that endpoint with the developer console open in Chrome actually crashes the developer console due to running out of memory.
This leads me to believe that the dashboard code itself is actually having issues processing all this data (It’s about 20 MB of nicely formatted JSON).
Given this, I’m inclined to suggest we need to rethink handling of this endpoint for the upcoming v2 REST API. I would suggest adding a variant of this endpoint that simply returns exactly the data needed by the dashboard to populate the navigation menu, and leave things like lists of dimensions, which are only needed when actually rendering the chart, to a secondary endpoint that can be queried per chart.
I can confirm the issue related to dbengine, because when I was working with integration between ebpf and cgroups I also need minutes to load dashboard when I had more than 250 containers and 21000 metrics.
Bug description
When dealing with a large number of disk devices (many hundreds), the Netdata agent runs into significan performance issues.
On my home server system, currently with 1136 device mapper nodes, it takes almost 90 seconds for the dashboard to load over a local network connection, almost five minutes for data to start appearing, and even once data shows up the display is very choppy.
On the same system, urning off data collection for virtual disks in the proc plugin results in the dashboard loading almost instantly, data starts showing up almost instantly, and everything is perfectly smooth.
I see a similar performance issues trying to load the node view in the cloud (though once it loads, data does start displaying much more quickly, and things are noticeably smoother), which suggests to me that the issue is in the agent and not the dashboard code.
Expected behavior
The dashboard loads smoothly, and displays data smoothly once it loads.
Steps to reproduce
Installation method
kickstart.sh
System info
Netdata build info
Additional info
I suspect this to be a combination of two issues, one in the parsing code for
/proc/diskstats
(not particularly bad there, but enough that it does have a measurable impact), and one in the dbengine query code.I can provide more detailed information as needed.
The text was updated successfully, but these errors were encountered: