Occasional panic and crash from AddrNotAvailable #8

ttencate · 2018-11-04T11:40:34Z

I'm seeing this about once a week (but at unpredictable moments) running journaldriver on a GCE instance:

thread 'main' panicked at 'log receiver encountered an unexpected error: Os { code: 99, kind: AddrNotAvailable, message: "Cannot assign requested address" }
stack backtrace:
   0:     0x5638719d00ec - <no info>
   1:     0x5638719cf752 - <no info>
   2:     0x5638719cf30f - <no info>
   3:     0x5638719cede1 - <no info>
   4:     0x5638719cee38 - <no info>
   5:     0x563871841da0 - <no info>
   6:     0x563871836cd2 - <no info>
   7:     0x5638719ee632 - <no info>
   8:     0x5638719fa339 - <no info>
   9:     0x5638719eef4a - <no info>
  10:     0x563871843213 - <no info>
  11:     0x7f1494e762b0 - <no info>
  12:     0x563871820c5d - <no info>
  13:                0x0 - <no info>', libcore/result.rs:945:5
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
   1: std::sys_common::backtrace::_print
   2: std::panicking::default_hook::{{closure}}
   3: std::panicking::default_hook
   4: std::panicking::rust_panic_with_hook
   5: std::panicking::continue_panic_fmt
   6: rust_begin_unwind
   7: core::panicking::panic_fmt
   8: core::result::unwrap_failed
   9: journaldriver::main
  10: std::rt::lang_start::{{closure}}
  11: std::panicking::try::do_call
  12: __rust_maybe_catch_panic
  13: std::rt::lang_start_internal
  14: main
  15: __libc_start_main
  16: _start
journaldriver.service: Main process exited, code=exited, status=101/n/a

I used RUST_BACKTRACE=1 in production to capture this trace, but I don't have a reliable way to reproduce the crash on demand.

According to the docs, AddrNotAvailable means "A nonexistent interface was requested or the requested address was not local."; not very descriptive. I suppose it corresponds to POSIX's EADDRNOTAVAIL, about which we can find a lot more information, e.g. from connect(2):

EADDRNOTAVAIL

(Internet domain sockets) The socket referred to by sockfd had not previously been bound to an
address and, upon attempting to bind it to an ephemeral port, it was determined that all port num‐
bers in the ephemeral port range are currently in use. See the discussion of
/proc/sys/net/ipv4/ip_local_port_range in ip(7).

The crash sometimes happens during a nightly cron job that causes some heavy network traffic; it mirrors the contents of an HTTP server onto Google Cloud Storage. But journaldriver often survives this job, and it sometimes crashes at other moments of the day as well.

The cron job isn't parallel, but it does make a lot of HTTP requests in short succession. (The Python requests package should use keep-alive but I'm not sure whether the Google Cloud Storage client library also does that. Would be silly if it didn't!) If each of those connections results in a socket lingering in TIME_WAIT state, I can imagine running out of ephemeral ports. If that hypothesis is correct, this ServerFault answer has some things I could try, in particular setting net.ipv4.tcp_tw_reuse=1.

But whatever the root cause is, I think journaldriver should not crash, but rather just log the error (probably at "notice" or "warning" level) and retry later.

The text was updated successfully, but these errors were encountered:

ttencate · 2018-11-08T08:07:26Z

Going to deploy a debug build now, so we'll hopefully get a proper backtrace for the actual panic as well.

ttencate · 2018-11-09T07:42:17Z

Proper backtrace for the actual panic:

thread 'main' panicked at 'log receiver encountered an unexpected error: Os { code: 99, kind: AddrNotAvailable, message: "Cannot assign requested address" }
stack backtrace:
   0:     0x55cebfecadf5 - backtrace::backtrace::libunwind::trace::hddcf23dceed63861
                        at .../github.com-1ecc6299db9ec823/backtrace-0.3.9/src/backtrace/libunwind.rs:53
                         - backtrace::backtrace::trace::hc0889779a13962b6
                        at .../github.com-1ecc6299db9ec823/backtrace-0.3.9/src/backtrace/mod.rs:42
   1:     0x55cebfec557e - backtrace::capture::Backtrace::new_unresolved::h135cd2dabc77d958
                        at .../github.com-1ecc6299db9ec823/backtrace-0.3.9/src/capture.rs:88
   2:     0x55cebfec3d63 - failure::backtrace::internal::InternalBacktrace::new::hf969fe947675c857
                        at .../github.com-1ecc6299db9ec823/failure-0.1.2/src/backtrace/internal.rs:44
   3:     0x55cebfec2cec - failure::backtrace::Backtrace::new::h4fc9ca820767cd90
                        at .../github.com-1ecc6299db9ec823/failure-0.1.2/src/backtrace/mod.rs:111
   4:     0x55cebfb382fd - <failure::error::error_impl::ErrorImpl as core::convert::From<F>>::from::h27c39d1f605833b1
                        at .../github.com-1ecc6299db9ec823/failure-0.1.2/src/error/error_impl.rs:20
   5:     0x55cebfb420ae - <failure::error::Error as core::convert::From<F>>::from::hcdf7ee17e121cfda
                        at .../github.com-1ecc6299db9ec823/failure-0.1.2/src/error/mod.rs:36
   6:     0x55cebfb33d03 - journaldriver::receiver_loop::h6cd209a6724f67b9
                        at src/main.rs:492
   7:     0x55cebfb37520 - journaldriver::main::h072e367093bba6a9
                        at src/main.rs:649
   8:     0x55cebfb1a9ef - std::rt::lang_start::{{closure}}::hcd7f7fa1470e80a0
                        at /build/rust/src/rustc-1.30.0-src/src/libstd/rt.rs:74
   9:     0x55cebfeff4b2 - std::panicking::try::do_call::h492948eea006f3bc
  10:     0x55cebff0af29 - __rust_maybe_catch_panic
  11:     0x55cebfeffe3a - std::rt::lang_start_internal::h7206055d3d0638b8
  12:     0x55cebfb1a9c7 - std::rt::lang_start::h1aa454bce21ee725
                        at /build/rust/src/rustc-1.30.0-src/src/libstd/rt.rs:74
  13:     0x55cebfb37a89 - main
  14:     0x7f1aeb9212b0 - __libc_start_main
  15:     0x55cebfb16c5d - _start
  16:                0x0 - <unknown>', libcore/result.rs:1009:5

tazjin · 2019-02-04T14:31:43Z

Totally missed this issue somehow, sorry!

But whatever the root cause is, I think journaldriver should not crash, but rather just log the error (probably at "notice" or "warning" level) and retry later.

I think I agree! Going to take a look at this.

tazjin added the bug Something isn't working label Feb 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional panic and crash from AddrNotAvailable #8

Occasional panic and crash from AddrNotAvailable #8

ttencate commented Nov 4, 2018 •

edited

Loading

ttencate commented Nov 8, 2018

ttencate commented Nov 9, 2018 •

edited

Loading

tazjin commented Feb 4, 2019

Occasional panic and crash from AddrNotAvailable #8

Occasional panic and crash from AddrNotAvailable #8

Comments

ttencate commented Nov 4, 2018 • edited Loading

ttencate commented Nov 8, 2018

ttencate commented Nov 9, 2018 • edited Loading

tazjin commented Feb 4, 2019

ttencate commented Nov 4, 2018 •

edited

Loading

ttencate commented Nov 9, 2018 •

edited

Loading