Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NGINX ingress creating endless core dumps #7080

Closed
AmitBenAmi opened this issue Apr 27, 2021 · 33 comments · Fixed by #7777
Closed

NGINX ingress creating endless core dumps #7080

AmitBenAmi opened this issue Apr 27, 2021 · 33 comments · Fixed by #7777
Labels
triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@AmitBenAmi
Copy link

NGINX Ingress controller version: v0.41.2

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-eks-d1db3c", GitCommit:"d1db3c46e55f95d6a7d3e5578689371318f95ff9", GitTreeState:"clean", BuildDate:"2020-10-20T22:18:07Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: AWS EKS

What happened:
My NGINX ingress controllers started to created endless core dump files.
This started to fill up some of my nodes' filesystem, creating disk-pressure on them and started to evict other pods.
I do not have any debug log set up or intentionally configured to create core dumps with NGINX.

What you expected to happen:
Not sure if preventing core dumps is the right way, gdb output in the bottom.

How to reproduce it: Not sure I understand why it happens now. We do have autoscaling enabled and I don't think we reach the resource limits, so not sure why it happens.

Anything else we need to know:
I managed to copy the core dump, and tried to investigate it, but couldn't find anything verbose about it:

GNU gdb (GDB) 7.11
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from nginx...done.
[New LWP 4969]
[New LWP 4981]
[New LWP 4971]
[New LWP 4979]
[New LWP 4974]
[New LWP 4982]
[New LWP 4983]
[New LWP 4977]
[New LWP 4970]
[New LWP 4972]
[New LWP 4988]
[New LWP 4986]
[New LWP 4973]
[New LWP 4980]
[New LWP 4978]
[New LWP 4976]
[New LWP 4975]
[New LWP 4984]
[New LWP 4989]
[New LWP 4987]
[New LWP 4985]
[New LWP 4990]
[New LWP 5001]
[New LWP 4994]
[New LWP 4995]
[New LWP 4991]
[New LWP 5000]
[New LWP 4992]
[New LWP 4999]
[New LWP 4996]
[New LWP 4993]
[New LWP 4997]
[New LWP 4998]

warning: Unexpected size of section `.reg-xstate/4969' in core file.

warning: Can't read pathname for load map: No error information.
Core was generated by `nginx: worker process                               '.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Unexpected size of section `.reg-xstate/4969' in core file.
#0  0x00007fd38a72d3ab in ?? () from /lib/libcrypto.so.1.1
[Current thread is 1 (LWP 4969)]
(gdb) backtrace
#0  0x00007fd38a72d3ab in ?? () from /lib/libcrypto.so.1.1
#1  0x00007fd38a72be21 in ?? () from /lib/libcrypto.so.1.1
#2  0x00007fd38a72bf24 in ASN1_item_free () from /lib/libcrypto.so.1.1
#3  0x00007fd38a94e62b in SSL_SESSION_free () from /lib/libssl.so.1.1
#4  0x00007fd38a7e2cdc in OPENSSL_LH_doall_arg () from /lib/libcrypto.so.1.1
#5  0x00007fd38a94f76c in SSL_CTX_flush_sessions () from /lib/libssl.so.1.1
#6  0x00007fd38a965896 in ?? () from /lib/libssl.so.1.1
#7  0x00007fd38a959f48 in ?? () from /lib/libssl.so.1.1
#8  0x00007fd38a948ec2 in SSL_do_handshake () from /lib/libssl.so.1.1
#9  0x000055614f8c0174 in ngx_ssl_handshake (c=c@entry=0x7fd38a2c4418) at src/event/ngx_event_openssl.c:1694
#10 0x000055614f8c058d in ngx_ssl_handshake_handler (ev=0x7fd38a0ebc40) at src/event/ngx_event_openssl.c:2061
#11 0x000055614f8bac1f in ngx_epoll_process_events (cycle=0x55615199b2f0, timer=<optimized out>, flags=<optimized out>) at src/event/modules/ngx_epoll_module.c:901
#12 0x000055614f8adc62 in ngx_process_events_and_timers (cycle=cycle@entry=0x55615199b2f0) at src/event/ngx_event.c:257
#13 0x000055614f8b82fc in ngx_worker_process_cycle (cycle=0x55615199b2f0, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:774
#14 0x000055614f8b6233 in ngx_spawn_process (cycle=cycle@entry=0x55615199b2f0, proc=0x55614f8b81d2 <ngx_worker_process_cycle>, data=0x0, name=0x55614f9dae3f "worker process", respawn=respawn@entry=0) at src/os/unix/ngx_process.c:199
#15 0x000055614f8b73aa in ngx_reap_children (cycle=cycle@entry=0x55615199b2f0) at src/os/unix/ngx_process_cycle.c:641
#16 0x000055614f8b9036 in ngx_master_process_cycle (cycle=0x55615199b2f0) at src/os/unix/ngx_process_cycle.c:174
#17 0x000055614f88ba00 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:385

In the meantime, I added a LimitRange for default limit of ephemeral-storage of 10Gi to prevent it to reach max node storage (my pods reached ~60Gi storage usage only from core dumps)

/kind bug

@AmitBenAmi AmitBenAmi added the kind/bug Categorizes issue or PR as related to a bug. label Apr 27, 2021
@tokers
Copy link
Contributor

tokers commented Apr 28, 2021

Seems that we don't have a mechanism to change the worker_rlimit_core directive, we may add this feature, what's your idea?

@AmitBenAmi
Copy link
Author

That sounds good on its own, however, it won't make NGINX to create less core dumps, meaning that if I restrict it to 2MB, it can still create thousands of these dumps and still explode my filesystem (unless I set it to 0, meaning that no core dumps will be created, in which case I'm ignoring the problem rather noticing it exists)

@tokers
Copy link
Contributor

tokers commented Apr 28, 2021

That's like an internal bug of OpenSSL, it's difficult to troubleshoot it as the debug symbols were stripped. We may wait for a while and see whether somebody has some similar experiences, which might be useful.

@longwuyuan
Copy link
Contributor

would you be interested in showing what your cluster looks like with ;

- cat /proc/cpuinfo ; cat /proc/meminfo # from nodes
- helm ls -A
- kubectl get all,nodes,ing -A -o wide
- kubectl -n ingresscontrollernamespace describe po ingresscontrollerpod
- Get the nginx.conf from inside the pod and paste it here
- CPu/Memory/Inodes/Disk related status from your monitoring

@AmitBenAmi
Copy link
Author

@longwuyuan I don't want to expose that kind of information on my environment.
If there is something more specific to this I can maybe share it, but this is a lot of information.
I can say that my ingress pods didn't terminated, only created significant amount of core dumps

@longwuyuan
Copy link
Contributor

Maybe write very clear details about hardware, software, config and the list of commands etc that someone can execute, for example on minikube, to be able to reproduce this problem

@AmitBenAmi
Copy link
Author

I have no idea how to reproduce this.
My hardware is EKS (AWS EC2).
NGINX docker image is: v0.41.2

About configuration, I have thousands of ingresses that populate nginx.conf automatically with hundreds of location and other nginx configurations.

Any idea how can I export a full dump interpretation on this to maybe help understand the problem?

@longwuyuan
Copy link
Contributor

not every AWS EKS user is reporting the same behaviour. There was one other issue reported stating core dumps. The best thought on that was to spread load. Any chance the problem is being caused by your use case only ?

/remove-kind bug
/triage needs-information

@k8s-ci-robot k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. and removed kind/bug Categorizes issue or PR as related to a bug. labels Apr 29, 2021
@AmitBenAmi
Copy link
Author

I double-checked and the load isn't different or suddenly too immense.
Screen Shot 2021-04-29 at 10 52 10

I guess it is probably an error with something in my environment and not necessarily a bug in NGINX, but my nginx.conf consist of thousands of lines, @longwuyuan do you have any idea on where should I look for in the configuration itself?

@longwuyuan
Copy link
Contributor

longwuyuan commented Apr 29, 2021

You could be hitting a limit or memory violation, hard to tell which until the core backtrace is explicit.
Your earlier post shows '?' symbol in gdb and then it shows crypto and then libssl. I am no developer so can't help much but I thought what someone said elsewhere, that '?' means you are missing symbols. And then crypto/ssl could mean all your TLS config was coming into play and nginx could not handle the size, as you say, you have thousands.

You can upgrade to most recent release of ingress-controller, check and verify, how to run gdb for nginx coredumps and post another backtrace that shows the size or any other details of that datastructure that its complaining about ;

Unexpected size of section `.reg-xstate/4969' in core file

Also you can try to replicate the size of objects in another cluster but try spreading the load.

@mitom
Copy link

mitom commented Jun 29, 2021

@tokers has the option to set worker_rlimit_core ever been added? We're now facing this issue and more or less know the root cause for us (it's a chain of user error in configuring a number of certificates, which cert-manager seems to endlessly re-try to validate via the http solver but fails because they're not set up properly, which leads to ssl errors in nginx ingress which seems to lead to core dumps which fills disk and in the end everything is dead).

I realise ignoring the coredumps is hiding the issue, but in our scenario this would be much preferred to taking out the entire ingress with some misconfigured certs.

@AmitBenAmi
Copy link
Author

@mitom how did you come up with finding that the chain is the root cause?
Is it something you found out from the core dumps themselves?

@mitom
Copy link

mitom commented Jun 29, 2021

No, the core dumps only contained:

#0  0x00007fa81c6d9c59 in ?? () from /lib/ld-musl-x86_64.so.1
#1  0x00000000000000a0 in ?? ()
#2  0x00007fff2b051e20 in ?? ()
#3  0x00007fff2b051db0 in ?? ()
#4  0x0000000000000000 in ?? ()

which doesn't mean anything to me really.

It is more or less an educated guess based on that around the time we have this issue the logs were spammed with invalid certificate errors in the controller.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2021
@sepich
Copy link
Contributor

sepich commented Sep 30, 2021

We have the same issue in coredump: #6896 (comment)
even with newer nginx and debian image.

more or less know the root cause for us (it's a chain of user error in configuring a number of certificates, which cert-manager seems to endlessly re-try to validate via the http solver but fails because they're not set up properly, which leads to ssl errors in nginx ingress which seems to lead to core dumps which fills disk and in the end everything is dead).

We also use cert-manager, but have no any errors or unvalidated certs. There are no errors both in cert-manager logs and ingress-nginx, but worker still dies with worker process 931 exited on signal 11.
Another thing i've notices is that nginx_ingress_controller_nginx_process_connections only grows and never reduces:
image
And each of these small steps up - is worker die event. So per nginx stats there should be currently 30k active connections.
But if I login to this exact pod - there is only 2k:

$ k -n ingress-nginx exec -it ingress-nginx-controller-5cf78859f4-7l9cc -- bash
bash-5.1$ netstat -tn | wc -l
2351

We cannot submit this issue to nginx upstream, because ingress-nginx compiles nginx from sources with additional plugins and patches. Also i have pretty limited knowledge of gdb and debug symbols, so was unable to find them for libssl both on alpine and debian to fix this part in coredump:

#2  0x00007f5bfdc7dd0d in OPENSSL_LH_doall_arg () from /lib/libcrypto.so.1.1
#3  0x00007f5bfddeb6d0 in SSL_CTX_flush_sessions () from /lib/libssl.so.1.1
#4  0x00007f5bfde01ad3 in ?? () from /lib/libssl.so.1.1
#5  0x00007f5bfddf5fb4 in ?? () from /lib/libssl.so.1.1

Any help would be greatly appreciated.

@rikatz
Copy link
Contributor

rikatz commented Sep 30, 2021

Hey @sepich thanks. I will start digging now into openssl problems, as we could remove the openresty bug.

Are you using NGINX v1.0.2?

Can you provide me some further information about the size of your environment, amount of Ingresses, amount of different SSL certificates?

Thanks

@sepich
Copy link
Contributor

sepich commented Sep 30, 2021

Are you using NGINX v1.0.2?

No we are still on k8s 1.19 and so ingress-nginx v0.49.2

size of your environment

From #6896 (comment)

It is just 215 Ingress objects / 80 rps, 3 ingress-nginx pods with 5% cpu load each.

99% of ingresses are SSL, so I would say it is 215 certs also. This number is pretty stable, not like ingresses are created and deleted each 5 min. More like once per week.

@rikatz
Copy link
Contributor

rikatz commented Sep 30, 2021

Ok, thanks! Will check ASAP :)

@rikatz
Copy link
Contributor

rikatz commented Sep 30, 2021

I'm wondering if this patch (https://github.com/openresty/openresty/blob/master/patches/openssl-1.1.1f-sess_set_get_cb_yield.patch) which is applied by Openresty shouldn't be applied in OpenSSL as well.

@rikatz
Copy link
Contributor

rikatz commented Oct 1, 2021

@sepich in case I generate an image of 0.49.3 (to be released) with Openresty OpenSSL patch applied, are you able to test and provide some feedback on that?

@doujiang24
Copy link

Hi @sepich , I have sent you an email to arrange a call with an interactive gdb session as said here.
#6896 (comment)

Thanks very much!

@sepich
Copy link
Contributor

sepich commented Oct 1, 2021

@rikatz, great finding!
This patch originally was created as two parts for both nginx and openssl: openresty/openresty@97901f3

https://github.com/openresty/lua-resty-core/blob/master/lib/ngx/ssl/session.md#description

This Lua API can be used to implement distributed SSL session caching for downstream SSL connections, thus saving a lot of full SSL handshakes which are very expensive.

I've checked that no ngx.ssl.session, ssl_session_fetch_by_lua* and ssl_session_store_by_lua* is being used in ingress-nginx. We also do not use any Lua code in ingress snippets. So, I've deleted images/nginx/rootfs/patches/nginx-1.19.9-ssl_sess_cb_yield.patch file (to avoid rebuilding openssl), then rebuild nginx and v0.49.2. But the issue and coredump backtrace is the same:

#5  0x00007fdb5619efb4 in ?? () from /lib/libssl.so.1.1
#6  0x0000562d5b8b8c68 in ngx_ssl_handshake (c=c@entry=0x7fdb55a2fa20) at src/event/ngx_event_openssl.c:1720
#7  0x0000562d5b8b9081 in ngx_ssl_handshake_handler (ev=0x7fdb5588a0c0) at src/event/ngx_event_openssl.c:2069

But there is one more patch for ngx_event_openssl.c - nginx-1.19.9-ssl_cert_cb_yield.patch:
https://github.com/openresty/lua-nginx-module#ssl_certificate_by_lua_block
Checked that ingress-nginx lua code does not use this, and rebuild image without this patch too.
But issue still remains.
Looks like I misunderstood something, maybe you can build some test image with minimum amount of patches only to make ingress-nginx-controller working?

@doujiang24, got it!

@rikatz
Copy link
Contributor

rikatz commented Oct 1, 2021 via email

@rikatz
Copy link
Contributor

rikatz commented Oct 1, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 1, 2021
@sepich
Copy link
Contributor

sepich commented Oct 1, 2021

you can build your own controller using it

Thank you, unfortunately it still fails (v0.49.2 on top of it):

Core was generated by `nginx: worker process                               '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f1b4834bb25 in ?? () from /usr/local/openresty/openssl111/lib/libssl.so.1.1
[Current thread is 1 (LWP 69)]
(gdb) bt
#0  0x00007f1b4834bb25 in ?? () from /usr/local/openresty/openssl111/lib/libssl.so.1.1
#1  0x00007f1b4819b86f in OPENSSL_LH_doall_arg () from /usr/local/openresty/openssl111/lib/libcrypto.so.1.1
#2  0x00007f1b4834cdb7 in SSL_CTX_flush_sessions () from /usr/local/openresty/openssl111/lib/libssl.so.1.1
#3  0x00007f1b48367d55 in ?? () from /usr/local/openresty/openssl111/lib/libssl.so.1.1
#4  0x00007f1b4835923d in ?? () from /usr/local/openresty/openssl111/lib/libssl.so.1.1
#5  0x0000561c111a7cda in ngx_ssl_handshake (c=c@entry=0x7f1b47b58b78) at src/event/ngx_event_openssl.c:1720
#6  0x0000561c111a80fe in ngx_ssl_handshake_handler (ev=0x7f1b479bc100) at src/event/ngx_event_openssl.c:2091
#7  0x0000561c111a26d3 in ngx_epoll_process_events (cycle=0x7f1b46488770, timer=<optimized out>, flags=<optimized out>) at src/event/modules/ngx_epoll_module.c:901
#8  0x0000561c111956b0 in ngx_process_events_and_timers (cycle=cycle@entry=0x7f1b46488770) at src/event/ngx_event.c:257
#9  0x0000561c1119fd7f in ngx_worker_process_cycle (cycle=0x7f1b46488770, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:753
#10 0x0000561c1119ddc0 in ngx_spawn_process (cycle=cycle@entry=0x7f1b46488770, proc=proc@entry=0x561c1119fc76 <ngx_worker_process_cycle>, data=data@entry=0x0,
    name=name@entry=0x561c112c8037 "worker process", respawn=respawn@entry=-4) at src/os/unix/ngx_process.c:199
#11 0x0000561c1119ea55 in ngx_start_worker_processes (cycle=cycle@entry=0x7f1b46488770, n=1, type=type@entry=-4) at src/os/unix/ngx_process_cycle.c:373
#12 0x0000561c111a0939 in ngx_master_process_cycle (cycle=0x7f1b46488770, cycle@entry=0x7f1b47f761a0) at src/os/unix/ngx_process_cycle.c:234
#13 0x0000561c11172c17 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386

Is it now possible to load openssl debug symbols somehow?

@doujiang24
Copy link

@sepich The debug symbol package for openresty-openssl111 is openresty-openssl111-dbg.
You can try to install it by apk add openresty-openssl111-dbg.

@sepich
Copy link
Contributor

sepich commented Oct 1, 2021

Thanks, it is:

echo https://openresty.org/package/alpine/v3.14/main >> /etc/apk/repositories
apk --allow-untrusted add openresty-openssl111-dbg

seems to be working:

(gdb) bt
#0  0x00007f1b4834bb25 in timeout_cb (s=0x7f1b471dc770, p=0x7ffd17e35850) at ssl/ssl_sess.c:1067
#1  0x00007f1b4819b86f in doall_util_fn (arg=0x7ffd17e35850, arg@entry=0x7ffd17e35810, func_arg=func_arg@entry=0x7f1b4834bb10 <timeout_cb>, func=0x0, use_arg=1, lh=0x7f1b45a1ab50)
    at crypto/lhash/lhash.c:196
#2  OPENSSL_LH_doall_arg (lh=0x7f1b45a1ab50, func=func@entry=0x7f1b4834bb10 <timeout_cb>, arg=arg@entry=0x7ffd17e35850) at crypto/lhash/lhash.c:211
#3  0x00007f1b4834cdb7 in lh_SSL_SESSION_doall_TIMEOUT_PARAM (arg=0x7ffd17e35850, fn=0x7f1b4834bb10 <timeout_cb>, lh=<optimized out>) at ssl/ssl_sess.c:1081
#4  SSL_CTX_flush_sessions (s=0x7f1b45a11390, t=<optimized out>) at ssl/ssl_sess.c:1096
#5  0x00007f1b48343e98 in ssl_update_cache (s=s@entry=0x7f1b391c3980, mode=mode@entry=2) at ssl/ssl_lib.c:3562
#6  0x00007f1b48367d55 in tls_construct_new_session_ticket (s=0x7f1b391c3980, pkt=<optimized out>) at ssl/statem/statem_srvr.c:4192
#7  0x00007f1b4835923d in write_state_machine (s=0x7f1b391c3980) at ssl/statem/statem.c:843
#8  state_machine (s=0x7f1b391c3980, server=1) at ssl/statem/statem.c:443
#9  0x0000561c111a7cda in ngx_ssl_handshake (c=c@entry=0x7f1b47b58b78) at src/event/ngx_event_openssl.c:1720
#10 0x0000561c111a80fe in ngx_ssl_handshake_handler (ev=0x7f1b479bc100) at src/event/ngx_event_openssl.c:2091
#11 0x0000561c111a26d3 in ngx_epoll_process_events (cycle=0x7f1b46488770, timer=<optimized out>, flags=<optimized out>) at src/event/modules/ngx_epoll_module.c:901
#12 0x0000561c111956b0 in ngx_process_events_and_timers (cycle=cycle@entry=0x7f1b46488770) at src/event/ngx_event.c:257
#13 0x0000561c1119fd7f in ngx_worker_process_cycle (cycle=0x7f1b46488770, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:753
#14 0x0000561c1119ddc0 in ngx_spawn_process (cycle=cycle@entry=0x7f1b46488770, proc=proc@entry=0x561c1119fc76 <ngx_worker_process_cycle>, data=data@entry=0x0,
    name=name@entry=0x561c112c8037 "worker process", respawn=respawn@entry=-4) at src/os/unix/ngx_process.c:199
#15 0x0000561c1119ea55 in ngx_start_worker_processes (cycle=cycle@entry=0x7f1b46488770, n=1, type=type@entry=-4) at src/os/unix/ngx_process_cycle.c:373
#16 0x0000561c111a0939 in ngx_master_process_cycle (cycle=0x7f1b46488770, cycle@entry=0x7f1b47f761a0) at src/os/unix/ngx_process_cycle.c:234
#17 0x0000561c11172c17 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386

(gdb) bt full
#0  0x00007f1b4834bb25 in timeout_cb (s=0x7f1b471dc770, p=0x7ffd17e35850) at ssl/ssl_sess.c:1067
No locals.
#1  0x00007f1b4819b86f in doall_util_fn (arg=0x7ffd17e35850, arg@entry=0x7ffd17e35810, func_arg=func_arg@entry=0x7f1b4834bb10 <timeout_cb>, func=0x0, use_arg=1, lh=0x7f1b45a1ab50)
    at crypto/lhash/lhash.c:196
        i = 1781
        a = <optimized out>
        n = 0x0
        i = <optimized out>
        a = <optimized out>
        n = <optimized out>
#2  OPENSSL_LH_doall_arg (lh=0x7f1b45a1ab50, func=func@entry=0x7f1b4834bb10 <timeout_cb>, arg=arg@entry=0x7ffd17e35850) at crypto/lhash/lhash.c:211
No locals.
#3  0x00007f1b4834cdb7 in lh_SSL_SESSION_doall_TIMEOUT_PARAM (arg=0x7ffd17e35850, fn=0x7f1b4834bb10 <timeout_cb>, lh=<optimized out>) at ssl/ssl_sess.c:1081
No locals.
#4  SSL_CTX_flush_sessions (s=0x7f1b45a11390, t=<optimized out>) at ssl/ssl_sess.c:1096
        i = 256
        tp = {ctx = 0x7f1b45a11390, time = 1633097512, cache = 0x7f1b45a1ab50}
#5  0x00007f1b48343e98 in ssl_update_cache (s=s@entry=0x7f1b391c3980, mode=mode@entry=2) at ssl/ssl_lib.c:3562
        stat = <optimized out>
        i = <optimized out>
#6  0x00007f1b48367d55 in tls_construct_new_session_ticket (s=0x7f1b391c3980, pkt=<optimized out>) at ssl/statem/statem_srvr.c:4192
        tctx = <optimized out>
        tick_nonce = "\000\000\000\000\000\000\000\001"
        age_add_u = {age_add_c = "Ɩ\240", <incomplete sequence \350>, age_add = 3902838470}
        err = <optimized out>
#7  0x00007f1b4835923d in write_state_machine (s=0x7f1b391c3980) at ssl/statem/statem.c:843
        post_work = 0x7f1b48368c10 <ossl_statem_server_post_work>
        mt = 4
        pkt = {buf = 0x7f1b399499b0, staticbuf = 0x0, curr = 57, written = 57, maxsize = 18446744073709551615, subs = 0x7f1b371d0b20}
        ret = <optimized out>
        pre_work = 0x7f1b483689e0 <ossl_statem_server_pre_work>
        get_construct_message_f = 0x7f1b48368fd0 <ossl_statem_server_construct_message>
        confunc = 0x7f1b483675f0 <tls_construct_new_session_ticket>
        st = 0x7f1b391c39c8
        transition = 0x7f1b48368580 <ossl_statem_server_write_transition>
        cb = 0x561c111a4491 <ngx_ssl_info_callback>
        st = <optimized out>
        ret = <optimized out>
        transition = <optimized out>
        pre_work = <optimized out>
        post_work = <optimized out>
        get_construct_message_f = <optimized out>
        cb = <optimized out>
        confunc = <optimized out>
        mt = <optimized out>
        pkt = {buf = <optimized out>, staticbuf = <optimized out>, curr = <optimized out>, written = <optimized out>, maxsize = <optimized out>, subs = <optimized out>}
#8  state_machine (s=0x7f1b391c3980, server=1) at ssl/statem/statem.c:443
        buf = 0x0
        cb = 0x561c111a4491 <ngx_ssl_info_callback>
        st = <optimized out>
        ret = <optimized out>
        ssret = <optimized out>
#9  0x0000561c111a7cda in ngx_ssl_handshake (c=c@entry=0x7f1b47b58b78) at src/event/ngx_event_openssl.c:1720
        n = <optimized out>
        sslerr = <optimized out>
        err = <optimized out>
        rc = <optimized out>
#10 0x0000561c111a80fe in ngx_ssl_handshake_handler (ev=0x7f1b479bc100) at src/event/ngx_event_openssl.c:2091
        c = 0x7f1b47b58b78

@doujiang24
Copy link

@sepich Great, the timeout_cb is the first key function we want to verify first.

@sepich
Copy link
Contributor

sepich commented Oct 4, 2021

@rikatz, Could you please share how to build image like rpkatz/nginx:patchedopenresty?
(@doujiang24 asks to add some debugging to openssl and test)

@rikatz
Copy link
Contributor

rikatz commented Oct 4, 2021

#7732 This way :)

@sepich
Copy link
Contributor

sepich commented Oct 7, 2021

While recompiling openssl I've found workaround for this issue - edit nginx.conf
ssl_session_cache builtin:1000 shared:SSL:10m;
and drop builtin:1000. From docs:

builtin
a cache built in OpenSSL; Use of the built-in cache can cause memory fragmentation.

using only shared cache without the built-in cache should be more efficient.

Unfortunately it is not exposed via some annotation, so have to edit template. There is even SO article for this.
Interesting to know why builtin:1000 is hardcoded. I understand that this is not a fix for openssl issue, but maybe drop builtin from template for everybody, as it stated in docs?

@doujiang24
Copy link

Hello, @sepich
Glad you find a way to avoid segfault in #7777 . But I think it may be a workaround, may not be a proper fix.
According to Nginx doc, use "builtin" and "shared" at the same time should be supported:
http://nginx.org/en/docs/http/ngx_http_ssl_module.html#ssl_session_cache

Unfortunately, however, after talking to OpenSSL and Nginx team, I still can not find where the bug is.
openssl/openssl#16733 (comment)
http://mailman.nginx.org/pipermail/nginx-devel/2021-October/014372.html

Hello @rikatz
Maybe you can help to expose an annotation to enable "builtin" and disable it by default?
So that one could try to reproduce it if they are still interesting to fix the bug.
I'm not sure if it is worthing, use "builtin" and "shared" may not be a good choice usually.

@rikatz
Copy link
Contributor

rikatz commented Oct 10, 2021

yeah sure, I will open a new PR and add that as a configuration :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants