Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reintroduce score submission retry mechanism #24609

Open
2 tasks
bdach opened this issue Aug 21, 2023 · 15 comments
Open
2 tasks

Reintroduce score submission retry mechanism #24609

bdach opened this issue Aug 21, 2023 · 15 comments
Labels
priority:1 Very important. Feels bad without fix. Affects the majority of users. type:online

Comments

@bdach
Copy link
Collaborator

bdach commented Aug 21, 2023

Type

Game behaviour

Bug description

In case of online outages, stable will retry score submission until success. This is not yet implemented in lazer, so a failed score submission will only fail once and then be lost to the depths of time. Users may understandably not be very happy with that, so we should make sure we have an equivalent (or better) mechanism of handling this.

Screenshots or videos

n/a

Version

current master

Logs

n/a

Tasks

  1. area:gameplay priority:1 type:behavioural
@peppy peppy added the priority:1 Very important. Feels bad without fix. Affects the majority of users. label Aug 21, 2023
@FreezyLemon
Copy link
Contributor

AFAIK, in stable, scores in the resubmission queue are lost if the game is closed or crashes.

How about some sort of protection mechanism against that? This would mean pending scores need to be stored on-disk, which comes with anti-cheat considerations. Not sure if the team feels comfortable with it, but I'll throw it out there.

@peppy
Copy link
Member

peppy commented Aug 25, 2023

out of scope.

@bdach
Copy link
Collaborator Author

bdach commented Oct 25, 2023

So I've put a little bit of thought into it and have a few thoughts...

Adding "retry" itself is not very straightforward, because as is, from a standpoint of a single user, the entire submission process ends up being quite serialised. In stable this wasn't generally an issue, since submission was a one-and-done affair - just a single request. In lazer we have score tokens and then actual submission and then also osu-server-spectator gets involved because it is the one responsible for uploading replays (stable just did that in the single request).

For a refresher, here's how the current flow looks (lifted from https://github.com/ppy/osu-infrastructure/blob/master/score-submission.md, with irrelevant bits dropped):

sequenceDiagram title Score submission flow
    participant C as osu!lazer
    participant W as osu-web
    participant S as osu-spectator-server
    participant DB as MySQL
    participant S3

    C->>+W: Request score submission token (POST /beatmaps/{beatmap_id}/solo/scores)

    W->>DB: Store token to solo_score_tokens
    W-->>-C: Provide {token_id} (APIScoreToken)

    C->>+S: Signal begin play (BeginPlaySession(token_id))
    Note over C: Playing beatmap

    loop Gameplay
        C->>S: Send frame bundle (SendFrameData)
    end

    Note over C: Finished playing
    C->>S: Signal finished (EndPlaySession)

    C->>+W: Submit score (PUT /beatmaps/{beatmap_id}/solo/scores/{token_id})
    W->>DB: Store score to solo_scores
    W-->>-C: Provide {score_id}
    
    par Replay upload
        DB-->>S: Found score_id for token (solo_score_tokens.score_id IS NOT NULL)
        S->>S3: Upload replay (ScoreUploader.Flush)
        S->>-DB: Mark has replay (UPDATE solo_scores SET has_replay = 1)
    end
Loading

To be more precise, the current situation is:

  • The client currently cannot handle parallel submission, for two reasons:
    • Score token requests and score submission requests are both queued onto the same API queue. So an unfinished submission blocks subsequent score token requests; you can't even begin a play until the preceding submission has finished.

      This is mostly a client-only concern and be addressed with a separate request queue; I'll leave it out of the following deliberations for now.

    • SpectatorClient cannot handle multiple active plays. It assumes one active play at any given time.

  • osu-web doesn't really care.
    • You can ask for multiple score tokens without a single submission and it will work.
    • As long as the score token exists (I believe they have a limited lifetime), you can submit the scores in any order.
  • osu-server-spectator kinda cares but kinda not.
    • BeginPlaySession() can be called multiple times without an EndPlaySession(). Doing this will just drop the old data.
    • EndPlaySession() however relies on the ambient user state to get the active score token, which means that it implicitly assumes only one score can be in flight at a time. Disregarding this would cause replays to be mis-uploaded and/or dropped.

I have a few ideas what could be done with it but they have sweeping implications...

Proposal A: Allow osu-server-spectator to handle multiple concurrent scores

sequenceDiagram title Score submission flow
    participant C as osu!lazer
    participant W as osu-web
    participant S as osu-spectator-server
    participant DB as MySQL
    participant S3

    C->>+W: Request score submission token (POST /beatmaps/{beatmap_id}/solo/scores)

    W->>DB: Store token to solo_score_tokens
    W-->>-C: Provide {token_id} (APIScoreToken)

    C->>+S: Signal begin play (BeginPlaySession(token_id))
    Note over C: Playing beatmap

    loop Gameplay
        C->>S: Send frame bundle (SendFrameData)
    end

    Note over C: Finished playing

    C->>+W: Submit score (PUT /beatmaps/{beatmap_id}/solo/scores/{token_id})
    W->>DB: Store score to solo_scores
    W-->>-C: Provide {score_id}
    C->>S: Signal finished (EndPlaySession(token_id))
    
    par Replay upload
        DB-->>S: Found score_id for token (solo_score_tokens.score_id IS NOT NULL)
        S->>S3: Upload replay (ScoreUploader.Flush)
        S->>-DB: Mark has replay (UPDATE solo_scores SET has_replay = 1)
    end
Loading

This is the minimalistic approach, keeping pieces as they are, and doing the bare minimum to avoid osu-server-spectator from confusing itself. All pieces are involved roughly as they were, but the assumption of "one score at a time" is removed from spectator server by way of (a) keeping multiple scores active in the user state and (b) providing the score token that initiated the submission when calling EndPlaySession(). This ensures that osu-server-spectator does not confuse multiple in-flight scores.

The assumption would be that spectators are presented with the latest started score; tracking unfinished scores would only be internal to the spectator server for replay upload purposes.

Tradeoffs:
➕ Cumulatively, probably the least work
➕ Addresses ppy/osu-server-spectator#191 - upon a transient submission failure, the EndPlaying() call would be delayed until submission completes, at which point the score token (or just plain score ID) provided therein allows everything to be handled correctly
➖ Complicated
➖ Would require dropping the "no concurrent active scores" safeties when communicating with server which may lead to Weird Things happening (especially when users try funny stuff)

Proposal B: Have osu-server-spectator be the middleman for the entirety of score submission

sequenceDiagram title Score submission flow
    participant C as osu!lazer
    participant W as osu-web
    participant S as osu-spectator-server
    participant DB as MySQL
    participant S3

    C->>+S: Signal begin play (BeginPlaySession({beatmapInfo, rulesetId, versionHash}))

    S->>+W: Request score submission token (POST /beatmaps/{beatmap_id}/solo/scores)

    W->>DB: Store token to solo_score_tokens
    W-->>-S: Provide {token_id} (APIScoreToken)
    S-->>C: Provide {token_id} (APIScoreToken)

    Note over C: Playing beatmap

    loop Gameplay
        C->>S: Send frame bundle (SendFrameData)
    end

    Note over C: Finished playing
    C->>S: Signal finished (EndPlaySession(token_id))

    S->>+W: Submit score (PUT /beatmaps/{beatmap_id}/solo/scores/{token_id})
    W->>DB: Store score to solo_scores
    S->>S3: Upload replay (ScoreUploader.Flush)
    S->>DB: Mark has replay (UPDATE solo_scores SET has_replay = 1)
    S-->>-C: Provide {score_id}
Loading

This was previously floated at some point. In this approach, the client does not communicate directly with web at all anymore. The entire submission flow is middle-manned by the spectator server. This still requires the changes from proposal A, so that spectator server can handle multiple in-flight scores.

➕ Simpler than current
➕ Addresses ppy/osu-server-spectator#191 by removing the problem entirely
➖ More work (will need backwards migration path)
➖ Makes osu-spectator-server much more critical than previously
➖ Has osu-web auth concerns (will need the submission endpoints to accept server-server calls) - or, could skip osu-web communication too, but that requires moving all logic out of web to spectator server

Proposal C: Exclude osu-server-spectator from replay handling entirely

sequenceDiagram title Score submission flow
    participant C as osu!lazer
    participant W as osu-web
    participant S as osu-spectator-server
    participant DB as MySQL
    participant S3

    C->>+W: Request score submission token (POST /beatmaps/{beatmap_id}/solo/scores)

    W->>DB: Store token to solo_score_tokens
    W-->>-C: Provide {token_id} (APIScoreToken)

    C->>+S: Signal begin play (BeginPlaySession(token_id))
    Note over C: Playing beatmap

    loop Gameplay
        C->>S: Send frame bundle (SendFrameData)
    end

    Note over C: Finished playing
    C->>S: Signal finished (EndPlaySession)

    C->>+W: Submit score with replay (PUT /beatmaps/{beatmap_id}/solo/scores/{token_id})
    W->>DB: Store score to solo_scores
    W->>S3: Upload replay (ScoreUploader.Flush)
    W->>DB: Mark has replay (UPDATE solo_scores SET has_replay = 1)
    W-->>-C: Provide {score_id}
Loading

Here, the replay is part of the score submission request. All that osu-server-spectator does from now on is handle spectating (and, well, notifications for user statistics updates I guess). It does not participate in submission.

➕ Simpler than current
osu-server-spectator becomes less critical
➕ Addresses ppy/osu-server-spectator#191 by removing the problem entirely
➖ More work
➖ Client becomes more authoritative (although it kinda already is, since it's the one sending replay frames to spectator unquestioned anyway, so it's a weak argument)
osu-web will need to ingest lots more data for submission (the replays aren't that large, but they're definitely larger than just SoloScore serialised to json).


Not sure anyone is going to read this wall of text, or that github is the right way to continue this conversation. Maybe going over this in a call would be best.

I'd be interested in thoughts from @ppy/team-client at the least. Maybe even @ppy/team-web. And maybe even @ThePooN.

@ThePooN
Copy link
Member

ThePooN commented Oct 25, 2023

A question I've always had in my mind to begin with: why do we use this replay recording flow with osu-spectator-server instead of just uploading on submission like osu-web? Always seemed over-complicated to me.

I'm hugely in favor of proposal C, at least for the time being as we aim to release ranked play with just a few weeks left, and I don't see the architectural issues with osu-spectator-server being solved anytime soon. And as I previously voiced, I don't want to spend my Christmas holidays working on infra again.

I am not sure what you mean with the last negative point in that proposal though, I don't think that's an issue.

@bdach
Copy link
Collaborator Author

bdach commented Oct 25, 2023

I am not sure what you mean with the last negative point in that proposal though, I don't think that's an issue.

Mostly spitballing since I don't have real experience with the infra. If it's not a problem then that's good.

@peppy
Copy link
Member

peppy commented Oct 26, 2023

A question I've always had in my mind to begin with: why do we use this replay recording flow with osu-spectator-server instead of just uploading on submission like osu-web? Always seemed over-complicated to me.

The idea is that it makes it much harder to fabricate score uploads, and gives us a lot more data points to work with.

Plus it uses "spectator" data as the final replay data, which just feels right.

This comes at a slight complexity cost and potential reliability cost for users with more flaky internet access.

Proposal C would be a hugely incorrect direction IMO.

I don't want to spend my Christmas holidays working on infra again.

If you need to do this then something is wrong. If I can't handle infrastructure myself anymore then everything you've changed is a step back and we should reassess.

@ThePooN
Copy link
Member

ThePooN commented Oct 26, 2023

A question I've always had in my mind to begin with: why do we use this replay recording flow with osu-spectator-server instead of just uploading on submission like osu-web? Always seemed over-complicated to me.

The idea is that it makes it much harder to fabricate score uploads, and gives us a lot more data points to work with.

Plus it uses "spectator" data as the final replay data, which just feels right.

This comes at a slight complexity cost and potential reliability cost for users with more flaky internet access.

Proposal C would be a hugely incorrect direction IMO.

I can't recall score uploads being exploited in recent times on stable, wouldn't this be something that's already mitigated enough with AC?
I agree the idea "feels right" and has merits but the implementation "feels wrong" as it stands atm. And I don't think it's an efficient path to try and fix given the target release date and history.

Another concern is resources usage. Such a SPOF should have as little responsability and load as possible. There will be inevitable scaling issues as I touched on this on Discord: https://discord.com/channels/90072389919997952/983550677794050108/1144383316523487334

I don't want to spend my Christmas holidays working on infra again.

If you need to do this then something is wrong. If I can't handle infrastructure myself anymore then everything you've changed is a step back and we should reassess.

I mean, it's the same thing we touched on with smoogi while we were together at COE. I don't mind doing incident response even during holidays, but if we have to fix stuff, work against architectural decisions at the same time, then we've failed somewhere. And nobody wants to watch the game struggle so I'd still jump in and help, but I'd rather have us plan ahead and not end up in that situation in the first place.

@bdach
Copy link
Collaborator Author

bdach commented Oct 26, 2023

The idea is that it makes it much harder to fabricate score uploads

Does it though? I've considered that, but it felt like a weak argument, since the frames that spectator server receives come from client anyways, and are accepted unquestioned. So I'm not sure that it helps prevent fabrication very much if at all (a rogue client can just report false frames the entire time rather than just at the end).

@peppy
Copy link
Member

peppy commented Oct 26, 2023

and are accepted unquestioned

Just like the fact we don't have any anti-cheat, these are systems which will be built on.

And I don't think it's an efficient path to try and fix given the target release date and history.

This is weird. It's already implemented, the proposals here are basically rewriting everything to add a small amount of reliability guarantee.

@peppy
Copy link
Member

peppy commented Oct 26, 2023

I'd say this:

Disregarding all discussion above, a first point to address is the ability for a user to reconnect to osu-server-spectator. Regardless of whether we move responsibilities around, we need to ensure that a dropped websocket connection is not the end of the world for multiplayer, spectator, and (currently) replay storage.

Once we have this in place, we can decide whether responsibilities need to be shuffled.

@peppy
Copy link
Member

peppy commented Dec 20, 2023

Moving out of the project for now but leaving as p1. We've had very few reports of submission failures (touch wood?).

@bdach bdach removed their assignment Feb 1, 2024
@bdach
Copy link
Collaborator Author

bdach commented Feb 1, 2024

Going to tentatively add this to project as stuff like https://twitter.com/reeeooot/status/1752843760935862398 is being talked about.

@iwa
Copy link

iwa commented Feb 18, 2024

if i could add my 2cts, it also happened to me even though i have a pretty stable fiber internet connection
happened only once tho, on a pretty good score so loss of pps, but still happened (proof)
it honestly feels really frustrating getting a score dropped just cause of some networking instability

i'm down sharing more details on my setup, logs, replay files or anything that could help y'all if needed

@peppy
Copy link
Member

peppy commented Feb 18, 2024

if i could add my 2cts, it also happened to me even though i have a pretty stable fiber internet connection happened only once tho, on a pretty good score so loss of pps, but still happened (proof) it honestly feels really frustrating getting a score dropped just cause of some networking instability

i'm down sharing more details on my setup, logs, replay files or anything that could help y'all if needed

Could you please attach logs? You can export using "Export logs" in settings.

@iwa
Copy link

iwa commented Feb 18, 2024

Could you please attach logs? You can export using "Export logs" in settings.

sure, there you go: compressed-logs.zip
but this happened on the 3rd Feb, not sure if the logs goes this far?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:1 Very important. Feels bad without fix. Affects the majority of users. type:online
Projects
None yet
Development

No branches or pull requests

5 participants