Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional panic and crash from AddrNotAvailable #8

Open
ttencate opened this issue Nov 4, 2018 · 3 comments
Open

Occasional panic and crash from AddrNotAvailable #8

ttencate opened this issue Nov 4, 2018 · 3 comments
Labels
bug Something isn't working

Comments

@ttencate
Copy link

ttencate commented Nov 4, 2018

I'm seeing this about once a week (but at unpredictable moments) running journaldriver on a GCE instance:

thread 'main' panicked at 'log receiver encountered an unexpected error: Os { code: 99, kind: AddrNotAvailable, message: "Cannot assign requested address" }
stack backtrace:
   0:     0x5638719d00ec - <no info>
   1:     0x5638719cf752 - <no info>
   2:     0x5638719cf30f - <no info>
   3:     0x5638719cede1 - <no info>
   4:     0x5638719cee38 - <no info>
   5:     0x563871841da0 - <no info>
   6:     0x563871836cd2 - <no info>
   7:     0x5638719ee632 - <no info>
   8:     0x5638719fa339 - <no info>
   9:     0x5638719eef4a - <no info>
  10:     0x563871843213 - <no info>
  11:     0x7f1494e762b0 - <no info>
  12:     0x563871820c5d - <no info>
  13:                0x0 - <no info>', libcore/result.rs:945:5
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
   1: std::sys_common::backtrace::_print
   2: std::panicking::default_hook::{{closure}}
   3: std::panicking::default_hook
   4: std::panicking::rust_panic_with_hook
   5: std::panicking::continue_panic_fmt
   6: rust_begin_unwind
   7: core::panicking::panic_fmt
   8: core::result::unwrap_failed
   9: journaldriver::main
  10: std::rt::lang_start::{{closure}}
  11: std::panicking::try::do_call
  12: __rust_maybe_catch_panic
  13: std::rt::lang_start_internal
  14: main
  15: __libc_start_main
  16: _start
journaldriver.service: Main process exited, code=exited, status=101/n/a

I used RUST_BACKTRACE=1 in production to capture this trace, but I don't have a reliable way to reproduce the crash on demand.

According to the docs, AddrNotAvailable means "A nonexistent interface was requested or the requested address was not local."; not very descriptive. I suppose it corresponds to POSIX's EADDRNOTAVAIL, about which we can find a lot more information, e.g. from connect(2):

EADDRNOTAVAIL

(Internet domain sockets) The socket referred to by sockfd had not previously been bound to an
address and, upon attempting to bind it to an ephemeral port, it was determined that all port num‐
bers in the ephemeral port range are currently in use. See the discussion of
/proc/sys/net/ipv4/ip_local_port_range in ip(7).

The crash sometimes happens during a nightly cron job that causes some heavy network traffic; it mirrors the contents of an HTTP server onto Google Cloud Storage. But journaldriver often survives this job, and it sometimes crashes at other moments of the day as well.

The cron job isn't parallel, but it does make a lot of HTTP requests in short succession. (The Python requests package should use keep-alive but I'm not sure whether the Google Cloud Storage client library also does that. Would be silly if it didn't!) If each of those connections results in a socket lingering in TIME_WAIT state, I can imagine running out of ephemeral ports. If that hypothesis is correct, this ServerFault answer has some things I could try, in particular setting net.ipv4.tcp_tw_reuse=1.

But whatever the root cause is, I think journaldriver should not crash, but rather just log the error (probably at "notice" or "warning" level) and retry later.

@ttencate
Copy link
Author

ttencate commented Nov 8, 2018

Going to deploy a debug build now, so we'll hopefully get a proper backtrace for the actual panic as well.

@ttencate
Copy link
Author

ttencate commented Nov 9, 2018

Proper backtrace for the actual panic:

thread 'main' panicked at 'log receiver encountered an unexpected error: Os { code: 99, kind: AddrNotAvailable, message: "Cannot assign requested address" }
stack backtrace:
   0:     0x55cebfecadf5 - backtrace::backtrace::libunwind::trace::hddcf23dceed63861
                        at .../github.com-1ecc6299db9ec823/backtrace-0.3.9/src/backtrace/libunwind.rs:53
                         - backtrace::backtrace::trace::hc0889779a13962b6
                        at .../github.com-1ecc6299db9ec823/backtrace-0.3.9/src/backtrace/mod.rs:42
   1:     0x55cebfec557e - backtrace::capture::Backtrace::new_unresolved::h135cd2dabc77d958
                        at .../github.com-1ecc6299db9ec823/backtrace-0.3.9/src/capture.rs:88
   2:     0x55cebfec3d63 - failure::backtrace::internal::InternalBacktrace::new::hf969fe947675c857
                        at .../github.com-1ecc6299db9ec823/failure-0.1.2/src/backtrace/internal.rs:44
   3:     0x55cebfec2cec - failure::backtrace::Backtrace::new::h4fc9ca820767cd90
                        at .../github.com-1ecc6299db9ec823/failure-0.1.2/src/backtrace/mod.rs:111
   4:     0x55cebfb382fd - <failure::error::error_impl::ErrorImpl as core::convert::From<F>>::from::h27c39d1f605833b1
                        at .../github.com-1ecc6299db9ec823/failure-0.1.2/src/error/error_impl.rs:20
   5:     0x55cebfb420ae - <failure::error::Error as core::convert::From<F>>::from::hcdf7ee17e121cfda
                        at .../github.com-1ecc6299db9ec823/failure-0.1.2/src/error/mod.rs:36
   6:     0x55cebfb33d03 - journaldriver::receiver_loop::h6cd209a6724f67b9
                        at src/main.rs:492
   7:     0x55cebfb37520 - journaldriver::main::h072e367093bba6a9
                        at src/main.rs:649
   8:     0x55cebfb1a9ef - std::rt::lang_start::{{closure}}::hcd7f7fa1470e80a0
                        at /build/rust/src/rustc-1.30.0-src/src/libstd/rt.rs:74
   9:     0x55cebfeff4b2 - std::panicking::try::do_call::h492948eea006f3bc
  10:     0x55cebff0af29 - __rust_maybe_catch_panic
  11:     0x55cebfeffe3a - std::rt::lang_start_internal::h7206055d3d0638b8
  12:     0x55cebfb1a9c7 - std::rt::lang_start::h1aa454bce21ee725
                        at /build/rust/src/rustc-1.30.0-src/src/libstd/rt.rs:74
  13:     0x55cebfb37a89 - main
  14:     0x7f1aeb9212b0 - __libc_start_main
  15:     0x55cebfb16c5d - _start
  16:                0x0 - <unknown>', libcore/result.rs:1009:5

@tazjin tazjin added the bug Something isn't working label Feb 4, 2019
@tazjin
Copy link
Owner

tazjin commented Feb 4, 2019

Totally missed this issue somehow, sorry!

But whatever the root cause is, I think journaldriver should not crash, but rather just log the error (probably at "notice" or "warning" level) and retry later.

I think I agree! Going to take a look at this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants