Crypto: Posthog analytics for problems when sending message keys over to-device messages #2409
Labels
A-E2EE
A-Telemetry
Telemetry / analytics to understand usage
T-Feature
Request to add a new feature which does not exist right now
There are various failure modes that can lead to problems sending to-device messages containing message keys, which will in turn lead to UTD errors. Currently, these are not reported in Posthog, so we lack visibility into how often they happen.
Likely root causes are the target user's homeserver being unreachable (related: #2154), or our own homeserver being unresponsive. More specific examples include:
/keys/query
request failed (or the user is on a server we backed off from)/keys/claim
for a given device (or have backed off for this device)See also #234 which covers the receiving side of this (and is IMHO much lower-hanging fruit).
Question
A single sent message could result in hundreds or thousands of errors, depending on the number of devices in the room. Similarly, a single failing user could cause lots of different sent messages to have some sort of error. Should we report an event for each device for each user for each message? Or something more intelligent? What exactly are we trying to achieve with these metrics?
Implementation design
Slightly tricky because the list of things we need to report on are scattered around the codebase, though it is mostly within
matrix-sdk-crypto
. I think the first step here is to define an interface inmatrix-sdk-crypto
which emits an enum of potential error codes.We can then add a method
OlmMachine::share_room_keys_failure_stream
, which returns aStream
, and each time something on the list above goes wrong, we write a new entry to the stream. The stream could then be wrapped in both (Rust)matrix-sdk
andmatrix-js-sdk
, for turning into Posthog events.The text was updated successfully, but these errors were encountered: