-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Destroying TorService a second time causes crash #57
Comments
This may or may not be related, but it seems that you don't keep a reference to the |
ah, that sounds like a good catch! I could never reliably reproduce it to troubleshoot it. |
seems like keeping a reference, then closing it in |
Here is also some related info and debugging: |
Background info about fdsan:
|
I don't think |
Seems like this might need to be handled in jtorctl, since it's |
This should get run most of the time, but not always, since Android could send `kill -9`. But this might be better handled in jtorctl's TorControlConnection, since it wraps those streams. guardianproject#57
If that commit works, there should be a build available in the artfacts in this job, once it completes: |
This should get run most of the time, but not always, since Android could send `kill -9`. But this might be better handled in jtorctl's TorControlConnection, since it wraps those streams. guardianproject#57
This should get run most of the time, but not always, since Android could send `kill -9`. But this might be better handled in jtorctl's TorControlConnection, since it wraps those streams. guardianproject#57
E/fdsan: attempted to close file descriptor, expected to be unowned, actually owned
If you close the control connection streams there, won't this interfere with the ability to use the control connection? I pushed an extra test here that allows me to reproduce the problem: grote@04d249d#diff-169d30aebc4b72de0419cb79771f8ced07f96a9d6353f17fd2bd86f5d3fe1f8cR1 Again, I urge you to read the fdsan docs: https://android.googlesource.com/platform/bionic/+/master/docs/fdsan.md |
For reference: The native file descriptor code is here: https://github.com/guardianproject/tor/blob/3b6987c9df1f15621866898babbb96e7a010e1e8/src/feature/api/org_torproject_jni_TorService.c#L269 |
I won't have time to dig into this for a bit. If you want to try the naive closing, it built here: |
That test case looks very promising! |
Full backtrace follows below. Looks like a double close:
|
Wonder why this isn't happening in Orbot.. |
@grote can you tell what is calling close each time? @asn-d6 @ahf this seems to crash the same way as https://gitlab.torproject.org/tpo/core/tor/-/issues/32729 and @grote has put together a repro test case. The stack trace looks different though, so I'm not sure it is 100% related. |
I wonder as well. Maybe it isn't really stopping the service? I noticed that unbinding isn't enough to stop it when running a foreground service.
No, I just got
as posted in #57 (comment) . So maybe Tor is closing it and then something in Android is trying as well to close it?
Has anyone been able to reproduce the issue on an API 29+ device? Either with an app or that instrumentation test? Knowing if others can reproduce it, could be the first step to solving it. |
I can reproduce the crash with @grote's test on the Redmi Note 7 (API 29).
|
I haven't finished debugging this but I'll add some notes about what I've found so far. So far I haven't been able to reproduce the crash when running only @grote's test case, but I can usually get a crash when running that test case along with the existing tests in TorServiceTest.java. (I had to fix a couple of tests - one was using a v2 onion URL, the other was using a plaintext HTTP URL which the API 29 device refused to access.) Only a few of the crashes were related to fdsan errors. Most were caused by SIGABRT with one of the following two backtraces, both of which I saw multiple times:
I started to wonder whether it was possible for one test's call to runMain() to overlap with another test's call to the same method, which might result in two threads accessing the same static state inside the Tor library. The fdsan errors would be a specific case of this involving a file descriptor, while the other two backtraces would involve some other static state. I added some debug logging and found that when a crash occurred, the previous test case's call to runMain() often returned after the crash. However, the problem wasn't exactly caused by overlapping calls to runMain(), because the crashing test case hadn't called runMain() at the time of the crash (or, possibly, it had done so but then it had crashed before the log message could be written to the log). I added a lock to prevent concurrent calls to runMain(), but this didn't help. I made the lock static so that different TorService instances would share the same lock, and this did seem to make a difference. Since making the lock static I've only managed to capture one crash, which was also the only run where the log indicated that a TorService instance was waiting for the lock - indicating that different test cases were indeed potentially calling into the Tor library concurrently from different threads. Since the crash still happens, I think the problem is not exactly concurrent calls to runMain(), but rather concurrent calls to a set of native methods that includes runMain(). I'm going to increase the amount of code guarded by the lock to see if the crashes can be eliminated, and if so, which concurrent native calls are involved. |
I think the crashes are probably caused by calling mainConfigurationFree() from onDestroy() while runMain() is still running in another thread (or may not even have been called yet... the thread that calls it may be doing the other native calls to set up the command line and control socket). Moving the native setup calls and runMain() into a try block, with mainConfigurationFree() in the finally block, and also using a static lock to prevent concurrent entry into that block, seems to prevent the various crashes. However, fdsan now reliably complains about mainConfigurationFree() closing Tor's owning_controller_socket. This socket is the "other end" of the socket that's returned to TorService, and Tor closes its end when the main configuration is released. This produces the following complaint from fdsan:
(In this case the FD for TorService's end of the control connection was 62, while the FD for Tor's end was 63.) This happens before the static lock is released, so I don't think the issue is that another TorService instance is messing with the contents of the main configuration. And yet I can only reproduce this when running the whole test suite... so some kind of interaction between the different test cases does seem to be involved. Perhaps when TorService closes the control socket, the FD on Tor's end is automatically closed and released as well. But Tor still has a record of the FD number in its main configuration struct. Later, some other caller asks for an FD and the number previously used by Tor gets reallocated. (Perhaps this part only happens if other test cases are running, which is why I can't reproduce the problem with a single test case?). Then later, when mainConfigurationFree() is called, Tor tries to close the FD that was already closed earlier, and fdsan notices the problem? |
@akwizgran sounds like you're thinking that the crash isn't so much about the Control Socket but rather knowing when to call Tor's free/shutdown methods. That was my impression also, when we were going through the troubleshooting as discussed in https://gitlab.torproject.org/tpo/core/tor/-/issues/32729. I also remember having the issue where these kinds of crashes happened only when running the whole test suite, where there were repeated start/stop cycles. The hard part is knowing exactly what is not cleaned up between runs, the docs never seem to fully outline it. Instead, there are statements like (keyword "most"):
There is also the poorly documented runner option: // The following argument makes the Android Test Orchestrator run its
// "pm clear" command after each test invocation. This command ensures
// that the app's state is completely cleared between tests.
testInstrumentationRunnerArguments clearPackageData: 'true' Also, FYI, the core JNI code used here came from @sysrqb, and I wrapped that with the UNIX domain socket. @sysrqb do you have any thoughts on a proper shutdown process when running tor in a thread? |
Right. Specifically, I think the crash is caused by calling mainConfigurationFree() when the thread that makes the other native calls (most importantly runMain()) is still running. The documentation for tor_main_configuration_free() in tor_api.h says:
Moving mainConfigurationFree() to the same thread as the other native calls, and using a static lock to ensure that only one TorService instance per process is making native calls, prevents the crash. I'll clean these changes up and push them to a branch. There's also some good news about the fdsan error. I can now reproduce this with only a single test case, which only binds and unbinds the service once. The error doesn't happen in every run, but in most runs fdsan warns about the file descriptor that's closed inside mainConfigurationFree(). If the test case finishes immediately then fdsan logs the initial warning but doesn't have time to log a backtrace or write a tombstone. Adding a one-second sleep to the end of the test case allows enough time for that to happen. I think it was probably possible to reproduce the fdsan error before, but I was expecting to see a crash, so I ignored all the runs where the tests passed without crashing. That was a mistake. In the default configuration, fdsan doesn't terminate the process. It records a tombstone for the first fdsan violation, then disables further fdsan checks. So when I saw fdsan tombstones after a crash, they weren't the cause of the crash. Once I had prevented the crashes this became clear. As far as I can tell, the fdsan error is unrelated to the crashes, and we might want to open a separate ticket for it. It's possible that it's a bug in Tor, where owning_controller_socket ought to be cleared if the control socket is closed before tor_main_configuration_free() is called, in order to avoid a double close. I believe the reason we don't see the fdsan warning on every run is that fdsan only detects the double close if the file descriptor number is reallocated between the time when Tor's end of the control socket is closed and the time when tor_main_configuration_free() is called. |
Great work! 🎉
Could it be that my API 31 device is configured differently? When using your branch it is still crashing when running the tests. With Android Studio, there's occasional crashes, but when running them with The tombstone it wrote has quite some extensive debug information, including a list of file descriptors that are currently open. https://pastebin.com/kmmT83yD |
Ah, good thought! https://developer.android.com/about/versions/11/behavior-changes-all#fdsan "The default mode for fdsan is changing in Android 11. fdsan now aborts upon detecting an error; the previous behavior was to log a warning and continue." |
Nice!
Here's my theory: FD 69 and 70 were originally allocated to the control socket, with 69 being TorService's end and 70 being Tor's end. Tor closed its end of the socket during shutdown and FD 70 was reused by |
To add weight to this theory, you could log the value of |
This is here as an experiment in case it is useful in Java space. It is looking like there are better ways to handle the shutdown, so my guess is that this will not be useful. * guardianproject/tor-android#59 * guardianproject/tor-android#61 (comment) * guardianproject/tor-android#57
closes guardianproject/tor-android#57 Should be removed once fixed in Tor upstream: https://gitlab.torproject.org/tpo/core/tor/-/issues/40525
This is here as an experiment in case it is useful in Java space. It is looking like there are better ways to handle the shutdown, so my guess is that this will not be useful. * guardianproject/tor-android#59 * guardianproject/tor-android#61 (comment) * guardianproject/tor-android#57
closes guardianproject/tor-android#57 Should be removed once fixed in Tor upstream: https://gitlab.torproject.org/tpo/core/tor/-/issues/40525
This is here as an experiment in case it is useful in Java space. It is looking like there are better ways to handle the shutdown, so my guess is that this will not be useful. * guardianproject/tor-android#59 * guardianproject/tor-android#61 (comment) * guardianproject/tor-android#57
closes guardianproject/tor-android#57 Should be removed once fixed in Tor upstream: https://gitlab.torproject.org/tpo/core/tor/-/issues/40525
This is here as an experiment in case it is useful in Java space. It is looking like there are better ways to handle the shutdown, so my guess is that this will not be useful. * guardianproject/tor-android#59 * guardianproject/tor-android#61 (comment) * guardianproject/tor-android#57
closes guardianproject/tor-android#57 Should be removed once fixed in Tor upstream: https://gitlab.torproject.org/tpo/core/tor/-/issues/40525
This is here as an experiment in case it is useful in Java space. It is looking like there are better ways to handle the shutdown, so my guess is that this will not be useful. * guardianproject/tor-android#59 * guardianproject/tor-android#61 (comment) * guardianproject/tor-android#57
closes guardianproject/tor-android#57 Should be removed once fixed in Tor upstream: https://gitlab.torproject.org/tpo/core/tor/-/issues/40525
This is here as an experiment in case it is useful in Java space. It is looking like there are better ways to handle the shutdown, so my guess is that this will not be useful. * guardianproject/tor-android#59 * guardianproject/tor-android#61 (comment) * guardianproject/tor-android#57
closes guardianproject/tor-android#57 Should be removed once fixed in Tor upstream: https://gitlab.torproject.org/tpo/core/tor/-/issues/40525
This is here as an experiment in case it is useful in Java space. It is looking like there are better ways to handle the shutdown, so my guess is that this will not be useful. * guardianproject/tor-android#59 * guardianproject/tor-android#61 (comment) * guardianproject/tor-android#57
closes guardianproject/tor-android#57 Should be removed once fixed in Tor upstream: https://gitlab.torproject.org/tpo/core/tor/-/issues/40525
This is here as an experiment in case it is useful in Java space. It is looking like there are better ways to handle the shutdown, so my guess is that this will not be useful. * guardianproject/tor-android#59 * guardianproject/tor-android#61 (comment) * guardianproject/tor-android#57
closes guardianproject/tor-android#57 Should be removed once fixed in Tor upstream: https://gitlab.torproject.org/tpo/core/tor/-/issues/40525
This is here as an experiment in case it is useful in Java space. It is looking like there are better ways to handle the shutdown, so my guess is that this will not be useful. * guardianproject/tor-android#59 * guardianproject/tor-android#61 (comment) * guardianproject/tor-android#57
closes guardianproject/tor-android#57 Should be removed once fixed in Tor upstream: https://gitlab.torproject.org/tpo/core/tor/-/issues/40525
This is here as an experiment in case it is useful in Java space. It is looking like there are better ways to handle the shutdown, so my guess is that this will not be useful. * guardianproject/tor-android#59 * guardianproject/tor-android#61 (comment) * guardianproject/tor-android#57
closes guardianproject/tor-android#57 Should be removed once fixed in Tor upstream: https://gitlab.torproject.org/tpo/core/tor/-/issues/40525
This is here as an experiment in case it is useful in Java space. It is looking like there are better ways to handle the shutdown, so my guess is that this will not be useful. * guardianproject/tor-android#59 * guardianproject/tor-android#61 (comment) * guardianproject/tor-android#57
closes guardianproject/tor-android#57 Should be removed once fixed in Tor upstream: https://gitlab.torproject.org/tpo/core/tor/-/issues/40525
This is here as an experiment in case it is useful in Java space. It is looking like there are better ways to handle the shutdown, so my guess is that this will not be useful. * guardianproject/tor-android#59 * guardianproject/tor-android#61 (comment) * guardianproject/tor-android#57
closes guardianproject/tor-android#57 Should be removed once fixed in Tor upstream: https://gitlab.torproject.org/tpo/core/tor/-/issues/40525
This is here as an experiment in case it is useful in Java space. It is looking like there are better ways to handle the shutdown, so my guess is that this will not be useful. * guardianproject/tor-android#59 * guardianproject/tor-android#61 (comment) * guardianproject/tor-android#57
closes guardianproject/tor-android#57 Should be removed once fixed in Tor upstream: https://gitlab.torproject.org/tpo/core/tor/-/issues/40525
This is here as an experiment in case it is useful in Java space. It is looking like there are better ways to handle the shutdown, so my guess is that this will not be useful. * guardianproject/tor-android#59 * guardianproject/tor-android#61 (comment) * guardianproject/tor-android#57
closes guardianproject/tor-android#57 Should be removed once fixed in Tor upstream: https://gitlab.torproject.org/tpo/core/tor/-/issues/40525
This is here as an experiment in case it is useful in Java space. It is looking like there are better ways to handle the shutdown, so my guess is that this will not be useful. * guardianproject/tor-android#59 * guardianproject/tor-android#61 (comment) * guardianproject/tor-android#57
While how to stop
TorService
is not documented, one usually stops a bound service simply by unbinding from it.This works fine the first time. But if I need to start and bind to TorService a second time, I reproducibly get a hard crash in
libc
when unbinding again:The text was updated successfully, but these errors were encountered: