Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Unable to restore session" due to lost pickle key #1816

Open
richvdh opened this issue May 8, 2024 · 25 comments
Open

"Unable to restore session" due to lost pickle key #1816

richvdh opened this issue May 8, 2024 · 25 comments
Labels
A-Storage Storage layer of the app, including IndexedDB, local storage, etc. O-Occasional Affects or can be seen by some users regularly or most users rarely S-Critical Prevents work, causes data loss and/or has no workaround T-Defect

Comments

@richvdh
Copy link
Member

richvdh commented May 8, 2024

There seems to be a problem wherein sometimes Element Desktop installations forget their pickle key, meaning users are unable to load their session.

The picklekey is used to encrypt things like the server access token, as well as the entire crypto DB (which holds the decryption keys for received messages).

On Element Desktop (in contrast with Element Web), the pickle key is stored in the system keyring, via keytar. We don't know why it is being lost.

The symptoms of the problem are a big error box like this on load:

image

NB: this is not the only root cause of that error; the only way to be sure it is the same problem is to inspect the console, or rageshake logs. It will (as of Element Desktop 1.11.78) show the line:

Error decrypting secret access_token: no pickle key found.

(Previously, the error message was Unable to load session Error: An object failed to be decrypted while unpickling)


Original description

We've seen a few user reports of "Unable to restore session" errors with things like the following in the logs:

2024-04-27T15:40:54.673Z E Unable to load session invalid type: integer `123`, expected a string at line 1 column 4
2024-04-30T06:35:34.559Z E Unable to load session invalid type: map, expected a sequence at line 1 column 0
2024-04-30T00:39:25.404Z E Unable to load session invalid type: null, expected a sequence at line 1 column 4

All of these suggest a problem deserialising the pickled account data in the core.account row in indexeddb. The first one suggests that the pickleKey has been deleted from the matrix-react-sdk indexeddb. The other two suggest some sort of malformed data in the matrix-js-sdk::matrix-sdk-crypto store. I have no real ideas how we're getting into this state.

@richvdh richvdh added T-Defect S-Major Severely degrades major functionality or product features, with no satisfactory workaround O-Uncommon Most users are unlikely to come across this or unexpected workflow labels May 8, 2024
@dosubot dosubot bot added A-Storage Storage layer of the app, including IndexedDB, local storage, etc. S-Critical Prevents work, causes data loss and/or has no workaround labels May 8, 2024
@richvdh
Copy link
Member Author

richvdh commented Jun 10, 2024

We suspect some of these failure modes might be an interaction with user being a guest and then trying to log in later.

@KrisVos130
Copy link

I have had this issue with Element Desktop about a dozen times over the past year or so. I submitted 3 rageshakes to #1632.

As I get this issue every couple of weeks, if there's any additional logging I can turn on to help fix this issue, or answer any questions, please let me know.

@kegsay
Copy link

kegsay commented Jun 11, 2024

I've just seen your bug report: thank you so much for being detailed on what you did. Thanks to this, I can clearly see that Unable to load session invalid type: integer 123`` is only happening on Electron (Desktop) clients.

This can be reproduced by deleting the pickleKey object store in the matrix-react-sdk database and then refreshing the client. What we're concerned about is whether this is A) uniquely caused by the new rust crypto stack, and if not B) is more frequent with the new rust crypto stack.

You say that you've had this "over the past year" - do you recall the earliest time and did you send a bug report when this happened? The earliest report I see from that user ID is Apr 22.

Also, do you ever try nightly builds of Element Desktop? How frequently do you upgrade versions? I'm wondering if you could be hitting electron/electron#27734 ?

@KrisVos130
Copy link

I did some digging in my own notes and chats, but couldn't find exact dates of every time it happened unfortunately.
I see I sent a rageshake on 2023-07-22 around 9am CEST to #692 (which I had completely forgotten about).
Many other times, I can imagine I either didn't send a rageshake due to privacy concerns, or I was unable to.

So I don't know the earliest time this exact issue happened to me, but the earliest time I can recall and sent a rageshake was 2023-07-22.

I also see my first report in the issue on element-desktop still uses Olm, so I assume that report and the times I experienced it before that were not using the rust crypto stack. Could be wrong of course. I also don't remember if I enabled rust crypto in labs or not. I doubt it, but can't be sure.

I haven't noticed an increase of this bug recently. Once every couple of weeks.

I never use the nightly builds of Element Desktop. I update every 1-3 weeks, and as the package in the AUR is not usually updated right away, it's not rare to be 1-2 versions behind.
In my latest report using the official tarballs, I hadn't updated Element in weeks, and hadn't updated packages in general in a few days. I also never downgrade Element, or mix and match different versions.

I have a few profiles in Element, but have exclusively used the default one for months.

@KrisVos130
Copy link

I did some more digging, and cross-referenced inactive sessions with some logging I have, and was able to get the dates of every time I had this issue in the past year or so. I am pretty confident these are all of them. It also appears that I overestimated how often this happened total. I have likely had this outcome a dozen times, but not in the past year, and with multiple different causes.

Looking at the dates, it does appear that the issue started happening more often more recently. I don't know what versions/dates the rust crypto became available/default, so can't comment on the relation to that.

  • 2023-02-23 21:13 (no rageshake sent - tray double Element bug - Application doesn't close to icon (no tray icon) element-web#23993 (comment) - possibly distro related, and has been fixed by now I think, not the same cause as the more recent ones)

  • 2023-07-21 08:41 (rageshake sent to element-desktop 692 - not sure if this has the same cause as the more recent ones)

  • 2024-02-09 18:40 (no rageshake)

  • 2024-02-27 16:33 (no rageshake)

  • 2024-04-08 21:12 (no rageshake)

  • 2024-04-22 05:36 (rageshake sent to element-desktop 1632)

  • 2024-05-01 09:16 (rageshake sent to element-desktop 1632)

  • 2024-06-10 20:50 (rageshake sent to element-desktop 1632)

@KrisVos130
Copy link

...and hopefully this will be my last message for a little while, but just had it happen again (rageshake sent). I was looking through the various tables in IndexedDB via devtools, noticed that the matrix-react-sdk pickleKey table was empty, was curious whether that meant it would be unable to restore the session, and sure enough. I was careful to not delete anything, so I didn't accidentally delete/edit/update/add anything in there. And it appears to always be empty, so even now with it being empty I can exit and re-open Element just fine...

And probably unrelated to this issue, but downloading logs via the submit debug logs crashed Element desktop twice for me.

@kegsay
Copy link

kegsay commented Jun 18, 2024

Thanks, this is all very valuable information. I would be very interested to know if any other tables are empty when this happens. Specifically:

  • matrix-react-sdk -> account
  • matrix-js-sdk:crypto -> account
  • Local Storage -> (any keys here)
  • Session Storage -> (any keys here)

@kegsay
Copy link

kegsay commented Jun 18, 2024

Worth noting that for Unable to load session invalid type: null, expected a sequence at line 1 column 4 we expect element-hq/element-web#27520 is the root cause (browser downgrades, electron downgrades). Some users who reported that their computer crashed see this as well.

The missing pickleKey (error Unable to load session invalid type: integer 123, expected a string at line 1 column 4) suggests things got nuked from the DB. Looking at Chrome Web Storage design docs, there is a clear difference between indexedDB and localStorage internally:

When an origin is evicted, all of its quota-managed data (Indexed DB, Cache API, etc) is deleted.

Screenshot 2024-06-18 at 15 04 40

This implies that indexedDB could be wiped but not local or session storage. This would put the app into an inconsistent state. This is at odds with MDN Eviction docs which states:

When an origin's data is evicted by the browser, all of its data, not parts of it, is deleted at the same time. If the origin had stored data by using IndexedDB and the Cache API for example, then both types of data are deleted.

Only deleting some of the origin's data could cause inconsistency problems.

If the app is in an inconsistent state I could see this causing some of these errors. Separately, why now. What is different with EW-R that is causing us to see this more frequently?

@KrisVos130
Copy link

Just experienced the issue again.

  • matrix-react-sdk -> account: mx_access_token and mx_refresh_token are both present, but the value is undefined.
  • matrix-js-sdk::matrix-sdk-crypto -> core -> account: account and private_identity are both present, and have values.
  • Local Storage has a lot of keys/values.
  • Session Storage has 2 keys/values.

Screenshot_20240627_061609

I took some screenshots of a few of the mentioned tables, so I can censor those and send those too if needed. And I made a backup of the .config/Electron , so could possibly answer additional questions regarding content of IndexedDB/Local Storage/Session Storage without waiting for the bug to happen (though I imagine parsing it won't be the easiest).

@kegsay
Copy link

kegsay commented Jun 27, 2024

Amazing, thank you! I'll look into this soon.

@kegsay
Copy link

kegsay commented Jun 27, 2024

Also seen:

  • Failed to read large IndexedDB value: twice from different users on Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.69 Chrome/124.0.6367.207 Electron/30.0.5 Safari/537.36 - which appears to be "Electron Application 30 on Linux". One report mentions that they had a hard crash. See https://issues.chromium.org/issues/342779913 ?

@kegsay
Copy link

kegsay commented Jun 28, 2024

It also looks like this may have been an issue for a while - element-hq/element-web#9338 seems eerily similar.

@kegsay
Copy link

kegsay commented Jul 8, 2024

My best guess on this is that hard crashes are corrupting the database. This has increased with rust SDK because the indexedDB implementation does not ask for durable writes because it can't.

I'm guessing that Electron is more affected because there's more opportunity for hard crashes - restarting a machine with chrome restores the tabs which I wonder if that behaves differently to closing an app in Electron.

@richvdh
Copy link
Member Author

richvdh commented Jul 28, 2024

2024-04-27T15:40:54.673Z E Unable to load session invalid type: integer 123, expected a string at line 1 column 4

I believe this is what will happen on Element Desktop, if the pickle key is not provided by the system keychain when Element starts. (When you first log in, a key is stored in the system keyring, which is then used to encrypt the E2EE database. If that key is removed from the system keychain, then you'll get this error.) Certainly that is what seems to have happened in @KrisVos130's report of 2024-06-11. The clue is this line in the logs:

2024-06-11T17:54:24.990Z E Expected a pickle key, but none provided.  Encryption may not work.

@richvdh
Copy link
Member Author

richvdh commented Jul 29, 2024

Also seen:

  • Failed to read large IndexedDB value: twice from different users on Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.69 Chrome/124.0.6367.207 Electron/30.0.5 Safari/537.36 - which appears to be "Electron Application 30 on Linux". One report mentions that they had a hard crash. See https://issues.chromium.org/issues/342779913 ?

Tracking this over at element-hq/element-web#27870 because I think it's different.

@richvdh
Copy link
Member Author

richvdh commented Jul 29, 2024

I strongly suspect that at least Unable to load session invalid type: integer `123`, expected a string is related to #1077. I think:

  • If the pickle key is missing altogether, we see the failure to restore the crypto db (Unable to load session invalid type)
  • If the pickle key is wrong, we see "Error decrypting secret access_token"

@t3chguy
Copy link
Member

t3chguy commented Jul 30, 2024

My best guess on this is that hard crashes are corrupting the database. This has increased with rust SDK because the indexedDB implementation does not ask for durable writes rustwasm/wasm-bindgen#3959.

This issue has been closed, can we now use durable writes?

@richvdh
Copy link
Member Author

richvdh commented Jul 30, 2024

Yes, I believe so. That's tracked over at matrix-org/matrix-rust-sdk#3354.

I'm a bit unconvinced that indexeddb corruption is the actual problem here.

@KrisVos130
Copy link

I have been using 1.11.68 for a month or so, and just downloaded 1.11.72 as it has some extra logging for this or similar issues. The first time I opened 1.11.72, I got the "Unable to restore session" screen.
I am not 100% sure if it is the same issue, or of it is related to updating, could be a coincidence.

The "Unable to load session" error is slightly different than the three listed in comment 1 of this issue, Unable to load session invalid type: sequence, expected a string at line 1 column 0. Previously, I think I only had the 123 one.

I also saw the following in journalctl:

element-desktop[6475]: Error getting the event index passphrase out of the secret store [Error: Can't find session /org/freedesktop/secrets/session/3]

I also saw something about the problem potentially being with the interaction with the wallet. So for some extra context, in the kdewallet, I can see passwords/keys for the sessions that have failed to load over the past few months. So if the issue is with the wallet, the password still exists at the very least. I also have had no issues with any other program so far.

@richvdh
Copy link
Member Author

richvdh commented Jul 31, 2024

It's becoming clear to me that these are three completely different root causes.

2024-04-27T15:40:54.673Z E Unable to load session invalid type: integer `123`, expected a string at line 1 column 4

This is what used to happen when the database was encrypted, but the picklekey was missing: the account is stored as an array of bytes containing the json serialization of an EncryptedValueBase64, which starts [123 (123 == {) but the code is expecting to find an object corresponding to the JSON serialization of a PickledAccount. (Apparently serde_json doesn't mind that it is an array rather than an object.)

As of v1.11.70 (which does a bit more checking on the values it is trying to deserialise, this may be replaced by simply Unable to load session Error: An object failed to be decrypted while unpickling. Either:

  • We have a new-style encrypted object, which is successfully deserialized into a MaybeEncrypted::Encrypted, but maybe_decrypt_value then spots that we have an encrypted value but no cipher, so returns the error, or:
  • We have an old-style encrypted object, consisting of an array starting [123 as above, which fails the same way as above.
2024-04-30T06:35:34.559Z E Unable to load session invalid type: map, expected a sequence at line 1 column 0

This is, probably, the opposite problem: the database is not encrypted, but we have a picklekey. The code is expecting to find an array of bytes, but in fact we have an unencrypted JSON serialization of a PickledAccount.

Again, this has changed as of v1.11.70: it will always report "An object failed to be decrypted while unpickling".

2024-04-30T00:39:25.404Z E Unable to load session invalid type: null, expected a sequence at line 1 column 4

This is a bit odd. Again, it happens when we do have a picklekey, and one thing that could cause it is when the stored "account" is just null, rather than an encrypted object. Once more, this will become "An object failed to be decrypted while unpickling" from v1.11.70.

It's rather unfortunate that everything will end up with the same error from v1.11.70 :/

@richvdh
Copy link
Member Author

richvdh commented Jul 31, 2024

@KrisVos130: thanks for the update. Yes, this does look like basically the same problem; v1.11.70 added a database migration step, and if that step runs when the encryption key is missing it will spit out that error.

@richvdh
Copy link
Member Author

richvdh commented Aug 6, 2024

I notice that this only seems to happen on the electron platform. I'll transfer it to the element-desktop repo.

@richvdh richvdh transferred this issue from element-hq/element-web Aug 6, 2024
@dosubot dosubot bot added A-E2EE O-Occasional Affects or can be seen by some users regularly or most users rarely labels Aug 6, 2024
@richvdh
Copy link
Member Author

richvdh commented Aug 6, 2024

Also, I'm pretty sure it's related to #1077.

@dbkr dbkr removed the S-Major Severely degrades major functionality or product features, with no satisfactory workaround label Aug 7, 2024
@richvdh
Copy link
Member Author

richvdh commented Aug 22, 2024

I think my next step here is going to be to add some sanity-checking to the bit where we try to load the access token from the database. Currently I think that is failing, but the error is being silently swallowed, and then we get this more cryptic error from the crypto stack. At least if we fail earlier we can give better error messages in the logs.

@richvdh richvdh changed the title "Unable to restore session" due to failure loading OlmAccount from indexedDB "Unable to restore session" due to lost pickle key Nov 17, 2024
@richvdh
Copy link
Member Author

richvdh commented Nov 17, 2024

[Updated the description to reflect the fact that there are now different symptoms when this happens]

@richvdh richvdh removed O-Uncommon Most users are unlikely to come across this or unexpected workflow A-E2EE labels Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Storage Storage layer of the app, including IndexedDB, local storage, etc. O-Occasional Affects or can be seen by some users regularly or most users rarely S-Critical Prevents work, causes data loss and/or has no workaround T-Defect
Projects
None yet
Development

No branches or pull requests

5 participants