Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop keep alive when sync is aborted #3686

Conversation

finsterwalder
Copy link
Contributor

@finsterwalder finsterwalder commented Aug 29, 2023

Type: [defect] T-Defect
This PR solves issue: #3678

When you start a client, which has connection problems, a keepAlive is started.
But when the connection is actively stopped e.g. via calling stopSync(), the connection is aborted.
When the connection is aborted, it does not make sense to continue to try to connect, because it will never succeed.

This is solved by this PR.

I could not find any tests for the sync, so I didn't find a place to extend tests to check this special case.

Checklist

  • Tests written for new code (and old code if feasible)
  • Linter and other CI checks pass
  • Sign-off given on the changes (see CONTRIBUTING.md)

Here's what your changelog entry will look like:

✨ Features

@finsterwalder finsterwalder requested a review from a team as a code owner August 29, 2023 20:37
@github-actions github-actions bot added the Z-Community-PR Issue is solved by a community member's PR label Aug 29, 2023
@finsterwalder finsterwalder changed the title Stop keep alive when sync is aborted T-Defect Stop keep alive when sync is aborted Aug 29, 2023
@finsterwalder finsterwalder changed the title T-Defect Stop keep alive when sync is aborted Stop keep alive when sync is aborted Aug 29, 2023
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution!

However... I don't really understand this, and it seems dangerous.

First of all, please can you explain the code path more clearly. Why is an AbortError thrown in this situation? I'm not very familiar with this area of the code, I'm afraid. In any case, the new code needs more comments to explain what is going on.

Second: how can you be sure that this is the only reason that an error like this will be thrown? I note that the code that handles regular /sync requests doesn't do anything like this. Instead, it checks this.running (

if (!this.running) {
). It seems like we should do the same thing here.

Third: please do add some tests for this case, otherwise there is a high chance of a regression. I think the best place is spec/integ/matrix-client-syncing.spec.ts.

@richvdh
Copy link
Member

richvdh commented Aug 30, 2023

@finsterwalder Also: I see that you have ticked the box for "Sign-off given on the changes" but I don't see your sign-off here. Please do read CONTRIBUTING.md for details on the DCO process.

@germain-gg germain-gg removed their request for review August 30, 2023 16:45
@finsterwalder
Copy link
Contributor Author

Thanks for checking my PR so quickly.

The code path is as follows.
When the sync is started, but fails initially, recoverFromSyncStartupError() is called, which indirectly calls pokeKeepAlive(), which keeps polling the versions endpoint.
This polling only stops, when we get a http response, that is either successful or has at least http status 400 or 404.
In all other cases it keeps polling.
When sync.stop() is called in this situation, abortController.abort() is called, which leads to all further http calls resulting in an AbortError (the http request itself is not even executed anymore). The above keep alive does not stop, though and will simply keep polling indefinitely.
So my rational was, that getting an AbortError is a sure sign, that polling can never succeed and should end.

But you are right. if (!this.running) is probably the more appropriate check.
It's set to true, right before the abortController is set and it's set to false, right before abortController.abort() is called. So it should have the same effect as checking for an AbortError.
I will update my PR accordingly.

I added a sign-off now. I did that on a different branch and then forgot it. ;-)

I looked into the test file, but I didn't see any tests, that check the keep alive feature.
That's quite a task for me to get acquainted enough to write such a test for the first time...

@richvdh richvdh self-requested a review August 31, 2023 06:04
Comment on lines +1607 to +1612
if (!this.running) {
clearTimeout(this.keepAliveTimer);
if (this.connectionReturnedDefer) {
this.connectionReturnedDefer.reject();
this.connectionReturnedDefer = undefined;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, the new code needs more comments to explain what is going on.

please do give this code some comments to explain what it is doing. Imagine you are unfamiliar with this code and trying to understand it. When does it happen, why is this the right thing to do? Are there other places in this file that have similar logic that it might be helpful to link to?

@richvdh
Copy link
Member

richvdh commented Sep 4, 2023

I added a sign-off now. I did that on a different branch and then forgot it. ;-)

I see a sign-off on one of the commits in this PR, but not the ones with all the new code! Best thing to do is to update the description of the PR to show that your signoff covers all the changes in the PR.

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into the test file, but I didn't see any tests, that check the keep alive feature.
That's quite a task for me to get acquainted enough to write such a test for the first time...

I talked to some other members of the team. We're agreed that we should not make an exception from our policy that all changes should include tests, and we think it's important that this change be tested. We don't feel like we can land it without tests.

It doesn't have to be a perfect test - all we're looking for is something that fails before your change, and passes afterwards.

I'm sorry that the existing code is not well tested, and I'm also sorry that nobody is in a position to give better guidance on writing a new test.

@finsterwalder
Copy link
Contributor Author

Ok. Good points. I will add documentation.
I will also fix the sign-off. Sorry for that, have never done that before. :-)
I will also look into creating a test. But that might take a little while...

Thinking about my problem a little more, I have some ideas for followup improvements. Was wondering, what you think about it:

This is a keepAlive, which keeps retrying the call to the matrix server. When I stop the sync, the retry should obviously stop. That's what my fix is about.
But maybe there are other error cases, where we know that retrying is useless, because it can never succeed?
e.g. A wrong auth token will never fix itself by just retrying. That's exactly the case that is at the root of my problem. Just that I need to detect the auth error myself (listening for an HttpApiEvent.SessionLoggedOut even) and the stopping the sync myself. With my fix, that stopSync call will stop the keepAlive now. But wouldn't it make sense to stop the keepAlive already, once a broken auth token is detected? So a response with http status 401 or 403 are pointless to retry IMHO. And maybe there are other cases, where retrying is pointless? How about 407 Proxy Authentication Required or 410 Gone (for good)? 414 URI Too Long? 431 Request Header Fields Too Large?

@richvdh
Copy link
Member

richvdh commented Sep 5, 2023

But maybe there are other error cases, where we know that retrying is useless, because it can never succeed?
e.g. A wrong auth token will never fix itself by just retrying. That's exactly the case that is at the root of my problem. Just that I need to detect the auth error myself (listening for an HttpApiEvent.SessionLoggedOut even) and the stopping the sync myself. With my fix, that stopSync call will stop the keepAlive now. But wouldn't it make sense to stop the keepAlive already, once a broken auth token is detected?

You're probably right that there are lots of other calls that aren't worth retrying. However: One problem is that, as things currently stand, the spec (https://spec.matrix.org/v1.8/client-server-api/#get_matrixclientv3sync) has very little to say about which error codes might be returned, and what you can infer from them.

A second problem: what should we actually do on one of these errors? If we just stop the sync loop, it just leaves the user with a dead client. Should we raise a SessionLoggedOut event? That's a bit of a nuclear bomb because it will make the client forget all of its state (including E2EE keys, etc). Who is to say that it is not a temporary outage?

So we might be able to improve this, but I strongly recommend focussing on stopSync during keepalive for now.

@finsterwalder
Copy link
Contributor Author

I understand that including other cases would need a little more thinking. I just wanted to raise awareness and start this discussion.
In this change I will only do the handling of !isRunning to stop after calling stopSync.

The 401 case is currently notified via the HttpApiEvent.SessionLoggedOut event, so there exists a possible handling strategy. And it's also clear, that this will not fix itself with retries. Would it make sense to include this case soon?

@finsterwalder
Copy link
Contributor Author

It was easier to create a new, clean Pull Request: #3720

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-Enhancement Z-Community-PR Issue is solved by a community member's PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants