forked from ttencate/journaldriver
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional panic and crash from AddrNotAvailable #8
Labels
bug
Something isn't working
Comments
Going to deploy a debug build now, so we'll hopefully get a proper backtrace for the actual panic as well. |
Proper backtrace for the actual panic:
|
Totally missed this issue somehow, sorry!
I think I agree! Going to take a look at this. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'm seeing this about once a week (but at unpredictable moments) running journaldriver on a GCE instance:
I used
RUST_BACKTRACE=1
in production to capture this trace, but I don't have a reliable way to reproduce the crash on demand.According to the docs,
AddrNotAvailable
means "A nonexistent interface was requested or the requested address was not local."; not very descriptive. I suppose it corresponds to POSIX'sEADDRNOTAVAIL
, about which we can find a lot more information, e.g. fromconnect(2)
:The crash sometimes happens during a nightly cron job that causes some heavy network traffic; it mirrors the contents of an HTTP server onto Google Cloud Storage. But journaldriver often survives this job, and it sometimes crashes at other moments of the day as well.
The cron job isn't parallel, but it does make a lot of HTTP requests in short succession. (The Python
requests
package should use keep-alive but I'm not sure whether the Google Cloud Storage client library also does that. Would be silly if it didn't!) If each of those connections results in a socket lingering inTIME_WAIT
state, I can imagine running out of ephemeral ports. If that hypothesis is correct, this ServerFault answer has some things I could try, in particular settingnet.ipv4.tcp_tw_reuse=1
.But whatever the root cause is, I think journaldriver should not crash, but rather just log the error (probably at "notice" or "warning" level) and retry later.
The text was updated successfully, but these errors were encountered: