Skip to content
This repository has been archived by the owner on Apr 19, 2024. It is now read-only.

[Bug]: Agent does not scale well when dealing with very large numbers of disk devices. #437

Open
Ferroin opened this issue Jul 30, 2022 · 2 comments
Labels
bug Something isn't working needs triage

Comments

@Ferroin
Copy link
Member

Ferroin commented Jul 30, 2022

Bug description

When dealing with a large number of disk devices (many hundreds), the Netdata agent runs into significan performance issues.

On my home server system, currently with 1136 device mapper nodes, it takes almost 90 seconds for the dashboard to load over a local network connection, almost five minutes for data to start appearing, and even once data shows up the display is very choppy.

On the same system, urning off data collection for virtual disks in the proc plugin results in the dashboard loading almost instantly, data starts showing up almost instantly, and everything is perfectly smooth.

I see a similar performance issues trying to load the node view in the cloud (though once it loads, data does start displaying much more quickly, and things are noticeably smoother), which suggests to me that the issue is in the agent and not the dashboard code.

Expected behavior

The dashboard loads smoothly, and displays data smoothly once it loads.

Steps to reproduce

  1. Create a very large number of disk devices (LVM's raidintegrity functionality is useful for this, as a simple two-device RAID1 LV with raid integrity enabled creates a total of nine DM device nodes).
  2. Try to load the local dashboard or cloud node view for the system.

Installation method

kickstart.sh

System info

Linux home-one 5.18.1-ahferroin7+ netdata/netdata#256 SMP Thu Jul 21 12:54:39 EDT 2022 x86_64 AMD Ryzen 9 3950X 16-Core Processor AuthenticAMD GNU/Linux
/etc/gentoo-release:Gentoo Base System release 2.8
/etc/lsb-release:DISTRIB_ID="Gentoo"
/etc/os-release:NAME=Gentoo
/etc/os-release:ID=gentoo
/etc/os-release:PRETTY_NAME="Gentoo Linux"
/etc/os-release:ANSI_COLOR="1;32"
/etc/os-release:VERSION_ID="2.8"

Netdata build info

Version: netdata v1.35.0-230-gd917f9831
Configure options:  '--prefix=/opt/netdata/usr' '--sysconfdir=/opt/netdata/etc' '--localstatedir=/opt/netdata/var' '--libexecdir=/opt/netdata/usr/libexec' '--libdir=/opt/netdata/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--enable-cloud' '--without-bundled-protobuf' '--disable-dependency-tracking' 'CFLAGS=-static -O2 -I/openssl-static/include -pipe' 'LDFLAGS=-static -L/openssl-static/lib' 'PKG_CONFIG_PATH=/openssl-static/lib/pkgconfig'
Install type: manual-static
    Binary architecture: x86_64
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES
    ACLK Next Generation:       YES
    ACLK-NG New Cloud Protocol: YES
    ACLK Legacy:                NO
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         YES
Libraries:
    protobuf:                YES (system)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    YES
    IPMI:                    NO
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES

Additional info

I suspect this to be a combination of two issues, one in the parsing code for /proc/diskstats (not particularly bad there, but enough that it does have a measurable impact), and one in the dbengine query code.

I can provide more detailed information as needed.

@Ferroin Ferroin added bug Something isn't working needs triage labels Jul 30, 2022
@Ferroin
Copy link
Member Author

Ferroin commented Jul 31, 2022

On further inspection, it appears this is an issue with the dashboard code specifically, albeit probably at least partially caused by the design of the REST API.

Manually querying /api/v1/charts on the system in question gets a complete response in less than one second. Trying to load that endpoint with the developer console open in Chrome actually crashes the developer console due to running out of memory.

This leads me to believe that the dashboard code itself is actually having issues processing all this data (It’s about 20 MB of nicely formatted JSON).


Given this, I’m inclined to suggest we need to rethink handling of this endpoint for the upcoming v2 REST API. I would suggest adding a variant of this endpoint that simply returns exactly the data needed by the dashboard to populate the navigation menu, and leave things like lists of dimensions, which are only needed when actually rendering the chart, to a secondary endpoint that can be queried per chart.

@ilyam8 ilyam8 transferred this issue from netdata/netdata Aug 4, 2022
@thiagoftsm
Copy link

Hello @Ferroin ,

I can confirm the issue related to dbengine, because when I was working with integration between ebpf and cgroups I also need minutes to load dashboard when I had more than 250 containers and 21000 metrics.

Best regards!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working needs triage
Projects
None yet
Development

No branches or pull requests

2 participants