[Bug]: Agent does not scale well when dealing with very large numbers of disk devices. #437

Ferroin · 2022-07-30T13:20:29Z

Bug description

When dealing with a large number of disk devices (many hundreds), the Netdata agent runs into significan performance issues.

On my home server system, currently with 1136 device mapper nodes, it takes almost 90 seconds for the dashboard to load over a local network connection, almost five minutes for data to start appearing, and even once data shows up the display is very choppy.

On the same system, urning off data collection for virtual disks in the proc plugin results in the dashboard loading almost instantly, data starts showing up almost instantly, and everything is perfectly smooth.

I see a similar performance issues trying to load the node view in the cloud (though once it loads, data does start displaying much more quickly, and things are noticeably smoother), which suggests to me that the issue is in the agent and not the dashboard code.

Expected behavior

The dashboard loads smoothly, and displays data smoothly once it loads.

Steps to reproduce

Create a very large number of disk devices (LVM's raidintegrity functionality is useful for this, as a simple two-device RAID1 LV with raid integrity enabled creates a total of nine DM device nodes).
Try to load the local dashboard or cloud node view for the system.

Installation method

kickstart.sh

System info

Linux home-one 5.18.1-ahferroin7+ netdata/netdata#256 SMP Thu Jul 21 12:54:39 EDT 2022 x86_64 AMD Ryzen 9 3950X 16-Core Processor AuthenticAMD GNU/Linux
/etc/gentoo-release:Gentoo Base System release 2.8
/etc/lsb-release:DISTRIB_ID="Gentoo"
/etc/os-release:NAME=Gentoo
/etc/os-release:ID=gentoo
/etc/os-release:PRETTY_NAME="Gentoo Linux"
/etc/os-release:ANSI_COLOR="1;32"
/etc/os-release:VERSION_ID="2.8"

Netdata build info

Version: netdata v1.35.0-230-gd917f9831
Configure options:  '--prefix=/opt/netdata/usr' '--sysconfdir=/opt/netdata/etc' '--localstatedir=/opt/netdata/var' '--libexecdir=/opt/netdata/usr/libexec' '--libdir=/opt/netdata/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--enable-cloud' '--without-bundled-protobuf' '--disable-dependency-tracking' 'CFLAGS=-static -O2 -I/openssl-static/include -pipe' 'LDFLAGS=-static -L/openssl-static/lib' 'PKG_CONFIG_PATH=/openssl-static/lib/pkgconfig'
Install type: manual-static
    Binary architecture: x86_64
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES
    ACLK Next Generation:       YES
    ACLK-NG New Cloud Protocol: YES
    ACLK Legacy:                NO
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         YES
Libraries:
    protobuf:                YES (system)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    YES
    IPMI:                    NO
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES

Additional info

I suspect this to be a combination of two issues, one in the parsing code for /proc/diskstats (not particularly bad there, but enough that it does have a measurable impact), and one in the dbengine query code.

I can provide more detailed information as needed.

The text was updated successfully, but these errors were encountered:

Ferroin · 2022-07-31T18:02:26Z

On further inspection, it appears this is an issue with the dashboard code specifically, albeit probably at least partially caused by the design of the REST API.

Manually querying /api/v1/charts on the system in question gets a complete response in less than one second. Trying to load that endpoint with the developer console open in Chrome actually crashes the developer console due to running out of memory.

This leads me to believe that the dashboard code itself is actually having issues processing all this data (It’s about 20 MB of nicely formatted JSON).

Given this, I’m inclined to suggest we need to rethink handling of this endpoint for the upcoming v2 REST API. I would suggest adding a variant of this endpoint that simply returns exactly the data needed by the dashboard to populate the navigation menu, and leave things like lists of dimensions, which are only needed when actually rendering the chart, to a secondary endpoint that can be queried per chart.

thiagoftsm · 2022-08-04T17:27:19Z

Hello @Ferroin ,

I can confirm the issue related to dbengine, because when I was working with integration between ebpf and cgroups I also need minutes to load dashboard when I had more than 250 containers and 21000 metrics.

Best regards!

Ferroin added bug Something isn't working needs triage labels Jul 30, 2022

ilyam8 transferred this issue from netdata/netdata Aug 4, 2022

cakrit mentioned this issue Aug 5, 2022

[Bug]: pulsar module creates charts with huge number of dimensions netdata/netdata#13320

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Agent does not scale well when dealing with very large numbers of disk devices. #437

[Bug]: Agent does not scale well when dealing with very large numbers of disk devices. #437

Ferroin commented Jul 30, 2022 •

edited

Loading

Ferroin commented Jul 31, 2022

thiagoftsm commented Aug 4, 2022

[Bug]: Agent does not scale well when dealing with very large numbers of disk devices. #437

[Bug]: Agent does not scale well when dealing with very large numbers of disk devices. #437

Comments

Ferroin commented Jul 30, 2022 • edited Loading

Bug description

Expected behavior

Steps to reproduce

Installation method

System info

Netdata build info

Additional info

Ferroin commented Jul 31, 2022

thiagoftsm commented Aug 4, 2022

Ferroin commented Jul 30, 2022 •

edited

Loading