Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crypto: Posthog analytics for problems when sending message keys over to-device messages #2409

Open
richvdh opened this issue Apr 29, 2024 · 1 comment
Labels
A-E2EE A-Telemetry Telemetry / analytics to understand usage T-Feature Request to add a new feature which does not exist right now

Comments

@richvdh
Copy link
Member

richvdh commented Apr 29, 2024

There are various failure modes that can lead to problems sending to-device messages containing message keys, which will in turn lead to UTD errors. Currently, these are not reported in Posthog, so we lack visibility into how often they happen.

Likely root causes are the target user's homeserver being unreachable (related: #2154), or our own homeserver being unresponsive. More specific examples include:

  • We lack a device list and the /keys/query request failed (or the user is on a server we backed off from)
  • We failed to /keys/claim for a given device (or have backed off for this device)
  • We couldn't send the to-device message itself.

See also #234 which covers the receiving side of this (and is IMHO much lower-hanging fruit).

Question

A single sent message could result in hundreds or thousands of errors, depending on the number of devices in the room. Similarly, a single failing user could cause lots of different sent messages to have some sort of error. Should we report an event for each device for each user for each message? Or something more intelligent? What exactly are we trying to achieve with these metrics?


Implementation design

Slightly tricky because the list of things we need to report on are scattered around the codebase, though it is mostly within matrix-sdk-crypto. I think the first step here is to define an interface in matrix-sdk-crypto which emits an enum of potential error codes.

We can then add a method OlmMachine::share_room_keys_failure_stream, which returns a Stream, and each time something on the list above goes wrong, we write a new entry to the stream. The stream could then be wrapped in both (Rust) matrix-sdk and matrix-js-sdk, for turning into Posthog events.

@richvdh richvdh added the A-E2EE label Apr 29, 2024
@richvdh richvdh added T-Feature Request to add a new feature which does not exist right now A-Telemetry Telemetry / analytics to understand usage labels Apr 29, 2024
@andybalaam andybalaam changed the title Crypto: Posthog analytics for problems when sending message keys keys over to-device messages Crypto: Posthog analytics for problems when sending message keys over to-device messages May 13, 2024
@BillCarsonFr
Copy link
Member

Having analytics for failure to decrypt to_device messages would be more usefull now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-E2EE A-Telemetry Telemetry / analytics to understand usage T-Feature Request to add a new feature which does not exist right now
Projects
None yet
Development

No branches or pull requests

3 participants