Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pihole-FTL stuck at 100% CPU #2154

Closed
kaechele opened this issue Jan 8, 2025 · 8 comments
Closed

pihole-FTL stuck at 100% CPU #2154

kaechele opened this issue Jan 8, 2025 · 8 comments

Comments

@kaechele
Copy link

kaechele commented Jan 8, 2025

Versions

Core
    Version is v5.18.3-596-g34069159 (Latest: null)
    Branch is development
    Hash is 34069159 (Latest: 6e633e4c)
Web
    Version is v5.21-1049-g943a2e51 (Latest: null)
    Branch is development
    Hash is 943a2e51 (Latest: be86950a)
FTL
    Version is vDev-1c1dff7 (Latest: null)
    Branch is development
    Hash is 1c1dff7f (Latest: e73c4e20)

Platform

  • OS and version: Fedora 41 Cloud Edition
  • Platform: KVM on x86_64

Expected behavior

pihole-FTL does not hang and responds to DNS queries.

Actual behavior / bug

pihole-FTL hangs after an unspecified amount of time, eventually hangs the entire system.

Steps to reproduce

I don't know what triggers this state, so I cannot reproduce it reliably.

Debug Token

Additional context

htop sees it like this:
Image

gdb backtrace from the stuck thread (pid 19509):

(gdb) bt
#0  __cp_end () at src/thread/x86_64/syscall_cp.s:29
#1  0x0000000000804cd7 in __syscall_cp_c (nr=47, u=<optimized out>, v=<optimized out>, w=<optimized out>, x=<optimized out>, y=<optimized out>, z=0)
    at src/thread/pthread_cancel.c:33
#2  0x00000000007f973a in __alt_socketcall (sys=47, sock=17, cp=1, a=<optimized out>, b=<optimized out>, c=<optimized out>, d=0, e=0, f=0) at ./src/internal/syscall.h:64
#3  recvmsg (fd=<optimized out>, msg=<optimized out>, msg@entry=0x7ffc850eed90, flags=flags@entry=34) at src/network/recvmsg.c:62
#4  0x00000000004d96f9 in netlink_recv (flags=<optimized out>) at /app/src/dnsmasq/netlink.c:114
#5  iface_enumerate (family=family@entry=0, parm=parm@entry=0x0, callback=..., callback@entry=...) at /app/src/dnsmasq/netlink.c:195
#6  0x00000000004a946a in find_mac (addr=addr@entry=0x0, mac=mac@entry=0x7fc0f81ff7db "", lazy=lazy@entry=1, now=1736301446) at /app/src/dnsmasq/arp.c:155
#7  0x000000000040f53e in _FTL_new_query (flags=flags@entry=138674304, name=<optimized out>, name@entry=0x7fc0f80522a0 "alibaba.com", addr=addr@entry=0x0, arg=<optimized out>, 
    qtype=<optimized out>, id=<optimized out>, id@entry=1627923, proto=INTERNAL, file=0x81b37c "/app/src/dnsmasq/forward.c", line=2304) at /app/src/dnsmasq_interface.c:958
#8  0x0000000000410162 in FTL_hook (flags=flags@entry=138674304, name=name@entry=0x7fc0f80522a0 "alibaba.com", addr=addr@entry=0x7fc0f8260438, arg=<optimized out>, 
    arg@entry=0x813189 "dnssec-query[DS]", id=<optimized out>, type=type@entry=53, file=0x81b37c "/app/src/dnsmasq/forward.c", line=2304) at /app/src/dnsmasq_interface.c:167
#9  0x00000000004b4311 in _log_query (flags=138674304, name=0x7fc0f80522a0 "alibaba.com", addr=0x7fc0f8260438, arg=0x813189 "dnssec-query[DS]", type=53, file=<optimized out>, 
    line=2304) at /app/src/dnsmasq/cache.c:2276
#10 0x00000000004cb6ca in _log_query_mysockaddr (flags=<optimized out>, name=<optimized out>, addr=<optimized out>, arg=<optimized out>, type=<optimized out>, 
    line=<optimized out>) at /app/src/dnsmasq/forward.c:135
#11 _log_query_mysockaddr (flags=138674176, name=<optimized out>, addr=0x7fc0f8260434, arg=0x813189 "dnssec-query[DS]", type=<optimized out>, line=2304)
    at /app/src/dnsmasq/forward.c:129
#12 tcp_key_recurse (now=now@entry=1736301444, status=status@entry=524288, header=header@entry=0x7fc0f2eb25f2, n=300, class=<optimized out>, name=0x7fc0f8056280 "alibaba.com", 
    keyname=<optimized out>, server=<optimized out>, keycount=<optimized out>, validatecount=<optimized out>, mark=<optimized out>, have_mark=<optimized out>)
    at /app/src/dnsmasq/forward.c:2304
#13 0x00000000004cbf82 in tcp_from_udp (now=now@entry=1736301444, status=524288, header=header@entry=0x7fc0f80575d0, plenp=plenp@entry=0x7ffc850ef388, class=class@entry=1, 
    name=0x7fc0f8056280 "alibaba.com", keyname=0x7fc0f80522a0 "alibaba.com", server=<optimized out>, keycount=0x7fc0f8254520, validatecount=0x7fc0f8254524)
    at /app/src/dnsmasq/forward.c:2193
#14 0x00000000004c01ef in swap_to_tcp (forward=<optimized out>, forward@entry=0x7fc0f8254490, now=now@entry=1736301444, status=<optimized out>, status@entry=524288, 
    header=header@entry=0x7fc0f80575d0, plen=plen@entry=0x7ffc850ef388, class=1, server=0x7fc0f8260410, keycount=<optimized out>, validatecount=<optimized out>)
    at /app/src/dnsmasq/dnsmasq.c:2194
#15 0x00000000004d0b1e in dnssec_validate (forward=forward@entry=0x7fc0f8254490, header=header@entry=0x7fc0f80575d0, plen=<optimized out>, plen@entry=52, 
    status=<optimized out>, status@entry=524288, now=now@entry=1736301444) at /app/src/dnsmasq/forward.c:952
#16 0x00000000004d156b in reply_query (fd=<optimized out>, now=now@entry=1736301444) at /app/src/dnsmasq/forward.c:1296
#17 0x00000000004bc496 in check_dns_listeners (now=now@entry=1736301444) at /app/src/dnsmasq/dnsmasq.c:1866
#18 0x00000000004bea10 in main_dnsmasq (argc=<optimized out>, argv=<optimized out>) at /app/src/dnsmasq/dnsmasq.c:1280
#19 0x0000000000402117 in main (argc=<optimized out>, argv=0x7ffc850ef738) at /app/src/main.c:123

Attaching strace to the thread (pid 19509) rapidly spams these two lines over and over:

recvmsg(19, {msg_name=0x7ffc850eed64, msg_namelen=0, msg_iov=[{iov_base="", iov_len=3776}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 0
recvmsg(19, {msg_name=0x7ffc850eed64, msg_namelen=12 => 0, msg_iov=[{iov_base="", iov_len=3776}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 0

A kill -9 19509 brought pihole-FTL back to life and it operates normally after that.

@kaechele
Copy link
Author

kaechele commented Jan 8, 2025

Happened again overnight.

Same strace behaviour.

Current backtrace is very similar, just different hostname being resolved:

(gdb) bt
#0  __cp_end () at src/thread/x86_64/syscall_cp.s:29
#1  0x0000000000804cd7 in __syscall_cp_c (nr=47, u=<optimized out>, v=<optimized out>, w=<optimized out>, x=<optimized out>, y=<optimized out>, z=0)
    at src/thread/pthread_cancel.c:33
#2  0x00000000007f973a in __alt_socketcall (sys=47, sock=17, cp=1, a=<optimized out>, b=<optimized out>, c=<optimized out>, d=0, e=0, f=0) at ./src/internal/syscall.h:64
#3  recvmsg (fd=<optimized out>, msg=<optimized out>, msg@entry=0x7ffc850eed90, flags=flags@entry=0) at src/network/recvmsg.c:62
#4  0x00000000004d9754 in netlink_recv (flags=<optimized out>) at /app/src/dnsmasq/netlink.c:132
#5  iface_enumerate (family=family@entry=0, parm=parm@entry=0x0, callback=..., callback@entry=...) at /app/src/dnsmasq/netlink.c:195
#6  0x00000000004a946a in find_mac (addr=addr@entry=0x0, mac=mac@entry=0x7fc0f81ff7db "", lazy=lazy@entry=1, now=1736344018) at /app/src/dnsmasq/arp.c:155
#7  0x000000000040f53e in _FTL_new_query (flags=flags@entry=138674432, name=<optimized out>, name@entry=0x7fc0f80522a0 "easybell.de", addr=addr@entry=0x0, arg=<optimized out>, 
    qtype=<optimized out>, id=<optimized out>, id@entry=1845749, proto=INTERNAL, file=0x81b37c "/app/src/dnsmasq/forward.c", line=2304) at /app/src/dnsmasq_interface.c:958
#8  0x0000000000410162 in FTL_hook (flags=flags@entry=138674432, name=name@entry=0x7fc0f80522a0 "easybell.de", addr=addr@entry=0x7fc0f826071c, arg=<optimized out>, 
    arg@entry=0x813189 "dnssec-query[DS]", id=<optimized out>, type=type@entry=53, file=0x81b37c "/app/src/dnsmasq/forward.c", line=2304) at /app/src/dnsmasq_interface.c:167
#9  0x00000000004b4311 in _log_query (flags=138674432, name=0x7fc0f80522a0 "easybell.de", addr=0x7fc0f826071c, arg=0x813189 "dnssec-query[DS]", type=53, file=<optimized out>, 
    line=2304) at /app/src/dnsmasq/cache.c:2276
#10 0x00000000004cb6ca in _log_query_mysockaddr (flags=<optimized out>, name=<optimized out>, addr=<optimized out>, arg=<optimized out>, type=<optimized out>, 
    line=<optimized out>) at /app/src/dnsmasq/forward.c:135
#11 _log_query_mysockaddr (flags=138674176, name=<optimized out>, addr=0x7fc0f8260714, arg=0x813189 "dnssec-query[DS]", type=<optimized out>, line=2304)
    at /app/src/dnsmasq/forward.c:129
#12 tcp_key_recurse (now=now@entry=1736344018, status=status@entry=524288, header=header@entry=0x7fc0f2eb2482, n=75, class=<optimized out>, name=0x7fc0f8056280 "easybell.de", 
    keyname=<optimized out>, server=<optimized out>, keycount=<optimized out>, validatecount=<optimized out>, mark=<optimized out>, have_mark=<optimized out>)
    at /app/src/dnsmasq/forward.c:2304
#13 0x00000000004cbf82 in tcp_from_udp (now=now@entry=1736344018, status=524288, header=header@entry=0x7fc0f80575d0, plenp=plenp@entry=0x7ffc850ef388, class=class@entry=1, 
    name=0x7fc0f8056280 "easybell.de", keyname=0x7fc0f80522a0 "easybell.de", server=<optimized out>, keycount=0x7fc0f82546a0, validatecount=0x7fc0f82546a4)
    at /app/src/dnsmasq/forward.c:2193
#14 0x00000000004c01ef in swap_to_tcp (forward=<optimized out>, forward@entry=0x7fc0f8254610, now=now@entry=1736344018, status=<optimized out>, status@entry=524288, 
    header=header@entry=0x7fc0f80575d0, plen=plen@entry=0x7ffc850ef388, class=1, server=0x7fc0f82606f0, keycount=<optimized out>, validatecount=<optimized out>)
    at /app/src/dnsmasq/dnsmasq.c:2194
#15 0x00000000004d0b1e in dnssec_validate (forward=forward@entry=0x7fc0f8254610, header=header@entry=0x7fc0f80575d0, plen=<optimized out>, plen@entry=47, 
    status=<optimized out>, status@entry=524288, now=now@entry=1736344018) at /app/src/dnsmasq/forward.c:952
#16 0x00000000004d156b in reply_query (fd=<optimized out>, now=now@entry=1736344018) at /app/src/dnsmasq/forward.c:1296
#17 0x00000000004bc496 in check_dns_listeners (now=now@entry=1736344018) at /app/src/dnsmasq/dnsmasq.c:1866
#18 0x00000000004bea10 in main_dnsmasq (argc=<optimized out>, argv=<optimized out>) at /app/src/dnsmasq/dnsmasq.c:1280
#19 0x0000000000402117 in main (argc=<optimized out>, argv=0x7ffc850ef738) at /app/src/main.c:123

New tricorder link: https://tricorder.pi-hole.net/urnNSqKO/

@DL6ER
Copy link
Member

DL6ER commented Jan 8, 2025

I prepared a patch hopefully fixing this and already sent it upstream to the dnsmasq project. I furthermore prepared a small testing branch you may want to explore:

sudo pihole checkout ftl fix/netlink_looping

edit The CI is taking longer than usual to build the pre-built FTL binaries, I will update here once it finished

edit 2 done

@kaechele
Copy link
Author

kaechele commented Jan 8, 2025

Thanks! I've deployed this branch. Let's see how it goes.

@kaechele
Copy link
Author

kaechele commented Jan 11, 2025

I'm afraid it is not fixed. I double checked that I am running the branch with the fix, which I was able to confirm.

Had both my instances crash again today in relatively short succession. I had to reboot one of them to restore proper functioning of my network (I probably need to fix my SSH server configs to not break when DNS is down).

The only thing happening around this time (within a few minutes of the crash) on the system, that I could see from the logs, is the receipt of a new DHCPv6 lease. I am currently running my DHCPv6 server with a very short lease time of 10 minutes for debugging purposes unrelated to pihole. So this is a frequent occurrence on both systems and is quite likely a red herring. Just thought I'd mention it.

Another observation: If I kill -9 the hanging process the problem is likely to re-occur in a relatively short amount of time (usually within 5-20 minutes).

Still getting:

strace -p spam:

recvmsg(19, {msg_name=0x7ffeac476f34, msg_namelen=0, msg_iov=[{iov_base="", iov_len=3776}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 0
recvmsg(19, {msg_name=0x7ffeac476f34, msg_namelen=12 => 0, msg_iov=[{iov_base="", iov_len=3776}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 0

gdb:

(gdb) bt
#0  __cp_end () at src/thread/x86_64/syscall_cp.s:29
#1  0x0000000000804c17 in __syscall_cp_c (nr=47, u=<optimized out>, v=<optimized out>, w=<optimized out>, x=<optimized out>, y=<optimized out>, z=0)
    at src/thread/pthread_cancel.c:33
#2  0x00000000007f967a in __alt_socketcall (sys=47, sock=17, cp=1, a=<optimized out>, b=<optimized out>, c=<optimized out>, d=0, e=0, f=0) at ./src/internal/syscall.h:64
#3  recvmsg (fd=<optimized out>, msg=<optimized out>, msg@entry=0x7ffeac476f60, flags=flags@entry=34) at src/network/recvmsg.c:62
#4  0x00000000004d97f9 in netlink_recv (flags=<optimized out>) at /app/src/dnsmasq/netlink.c:114
#5  iface_enumerate (family=family@entry=0, parm=parm@entry=0x0, callback=..., callback@entry=...) at /app/src/dnsmasq/netlink.c:195
#6  0x00000000004a946a in find_mac (addr=addr@entry=0x0, mac=mac@entry=0x7f8a7f3da553 "", lazy=lazy@entry=1, now=1736609840) at /app/src/dnsmasq/arp.c:155
#7  0x000000000040f53e in _FTL_new_query (flags=flags@entry=138674304, name=<optimized out>, name@entry=0x7f8a7f22f2a0 "opensubtitles.org", addr=addr@entry=0x0, 
    arg=<optimized out>, qtype=<optimized out>, id=<optimized out>, id@entry=523594, proto=INTERNAL, file=0x81b366 "/app/src/dnsmasq/forward.c", line=2286)
    at /app/src/dnsmasq_interface.c:958
#8  0x0000000000410162 in FTL_hook (flags=flags@entry=138674304, name=name@entry=0x7f8a7f22f2a0 "opensubtitles.org", addr=addr@entry=0x7f8a7f43d438, arg=<optimized out>, 
    arg@entry=0x813173 "dnssec-query[DS]", id=<optimized out>, type=type@entry=53, file=0x81b366 "/app/src/dnsmasq/forward.c", line=2286) at /app/src/dnsmasq_interface.c:167
#9  0x00000000004b4311 in _log_query (flags=138674304, name=0x7f8a7f22f2a0 "opensubtitles.org", addr=0x7f8a7f43d438, arg=0x813173 "dnssec-query[DS]", type=53, 
    file=<optimized out>, line=2286) at /app/src/dnsmasq/cache.c:2276
#10 0x00000000004cb78a in _log_query_mysockaddr (flags=<optimized out>, name=<optimized out>, addr=<optimized out>, arg=<optimized out>, type=<optimized out>, 
    line=<optimized out>) at /app/src/dnsmasq/forward.c:135
#11 _log_query_mysockaddr (flags=138674176, name=<optimized out>, addr=0x7f8a7f43d434, arg=0x813173 "dnssec-query[DS]", type=<optimized out>, line=2286)
    at /app/src/dnsmasq/forward.c:129
#12 tcp_key_recurse (now=now@entry=1736609840, status=status@entry=524288, header=header@entry=0x7f8a77dc1552, n=191, class=<optimized out>, 
    name=0x7f8a7f233280 "opensubtitles.org", keyname=<optimized out>, server=<optimized out>, keycount=<optimized out>, validatecount=<optimized out>, mark=<optimized out>, 
    have_mark=<optimized out>) at /app/src/dnsmasq/forward.c:2286
#13 0x00000000004cc042 in tcp_from_udp (now=now@entry=1736609840, status=524288, header=header@entry=0x7f8a7f2345d0, plenp=plenp@entry=0x7ffeac477558, class=class@entry=1, 
    name=0x7f8a7f233280 "opensubtitles.org", keyname=0x7f8a7f22f2a0 "opensubtitles.org", server=<optimized out>, keycount=0x7f8a7f431760, validatecount=0x7f8a7f431764)
    at /app/src/dnsmasq/forward.c:2175
#14 0x00000000004c0217 in swap_to_tcp (forward=<optimized out>, forward@entry=0x7f8a7f4316d0, now=now@entry=1736609840, status=<optimized out>, status@entry=524288, 
    header=header@entry=0x7f8a7f2345d0, plen=plen@entry=0x7ffeac477558, class=1, server=0x7f8a7f43d410, keycount=<optimized out>, validatecount=<optimized out>)
    at /app/src/dnsmasq/dnsmasq.c:2204
#15 0x00000000004d0c3e in dnssec_validate (forward=forward@entry=0x7f8a7f4316d0, header=header@entry=0x7f8a7f2345d0, plen=<optimized out>, plen@entry=50, 
    status=<optimized out>, status@entry=524288, now=now@entry=1736609840) at /app/src/dnsmasq/forward.c:922
#16 0x00000000004d1683 in reply_query (fd=<optimized out>, now=now@entry=1736609840) at /app/src/dnsmasq/forward.c:1265
#17 0x00000000004bc496 in check_dns_listeners (now=now@entry=1736609840) at /app/src/dnsmasq/dnsmasq.c:1876
#18 0x00000000004bea50 in main_dnsmasq (argc=<optimized out>, argv=<optimized out>) at /app/src/dnsmasq/dnsmasq.c:1290
#19 0x0000000000402117 in main (argc=<optimized out>, argv=0x7ffeac477918) at /app/src/main.c:123

EDIT: If it makes your life easier, I'm happy to also post a screenshot of the colorized gdb output in addition to the raw text version. Just let me know. Unfortunately GitHub markdown doesn't support shell escape colors.

@DL6ER
Copy link
Member

DL6ER commented Jan 11, 2025

The condition for this to be triggered is a query whose response during DNSSEC validation is too large to fit into the UDP reply (typically a DNSKEY). dnsmasq in its most recent version we have for testing in the v6 beta code then retries this over TCP and this closes the netlink socket. In your case, the backtrace reveals it was for the domain opensubtitles.org. Interestingly enough, this query does not cause a TCP retry for me.

I pushed another iteration to the special branch which should fix the root cause rather than trying to band-aid fix the symptoms. Please try again, sorry for the troubles. pihole-FTL --hash should return 70937ae5 if you have the correct binary. The CI should be ready in a few minutes.

edit Hash 2fd72fdf should be equivalent

@kaechele
Copy link
Author

Thanks! I have deployed commit 2fd72fdf. Will report back.

@kaechele
Copy link
Author

I think the last fix worked. No runaway threads in 3 days.

@DL6ER
Copy link
Member

DL6ER commented Jan 14, 2025

The fix has just been merged into development, please run

sudo pihole checkout ftl development

to continue receiving updates

@DL6ER DL6ER closed this as completed Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants