Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resilience with bugs in production #4830

Open
monsieurtanuki opened this issue Nov 20, 2023 · 7 comments
Open

Resilience with bugs in production #4830

monsieurtanuki opened this issue Nov 20, 2023 · 7 comments

Comments

@monsieurtanuki
Copy link
Contributor

A bug in production (#4807) was detected 4 days ago, and we managed to release a fix.
When the bug happened on a device, it was systematic and prevented the user from entering the app.
Thank you @g123k for finding the faulty PR, and thanks to the other users/developers/managers for making the whole process that quick!

The good thing is that we were relatively quick to release a fix (4 days from bug detection).
But there are points where we should improve:

  1. it was not a bug easy to detect
    1. it depended on the device (didn't happen on my smartphone when I re-tested yesterday)
    2. it happened only for app upgrades, not app installs from scratch
    3. it happened in some conditions, like products inserted (it's not an unusual situation but we have to test it)
  2. the release process made it hard (or impossible) to find which PR was problematic as AFAIK there's no way to know which PRs are in which release. Actually there's a way but it's not trustworthy as this page does not reflect all the changes.
  3. in some devices (e.g. @teolemon's) the build number is not accessible through Android system - My version of Android hides build number - therefore if it's a crash at init we're stuck
  4. the bug itself was due to hive (with my help!), whose latest (last?) refresh was 16 months ago. I could create an issue on hive but it wouldn't be processed as hive is sleeping while its developers work on another database.
@teolemon
Copy link
Member

I would add that we don't generate internal releases as often as we used to. I would strongly advocate to automatically generate one build a day in internal (or no build if no commit), so that the 20 people in there can spot issues early.

@monsieurtanuki
Copy link
Contributor Author

More frequent internal releases, why not?
"Automatically" I'm not convinced, as there are some PR that need to be merged together (e.g. my work on background tasks last year or @g123k's work on the next app redesign).
I suggest rather weekly, manual and on Monday.
Besides, we should double-check if "internal releases" do mean "sandbox", as yesterday I think it meant a bug in production.

@teolemon
Copy link
Member

  • The point of failure was the manual build, and the manual triggering. Thus my emphasis on automatic triggering.
  • Or we do ad-hoc distribution (but can we keep the package name)

@g123k
Copy link
Collaborator

g123k commented Nov 20, 2023

Here are my 2 cents.

Honestly, the bug was difficult to detect, and I could have done it too (e.g.: with the migration process). So please don't take this as a failure, we will simply learn from our mistakes.

The problem
I'm clearly not satisfied with how we release new app versions.
First, it's totally opaque: we don't know when it happens? or why was it released.
For me, we're at a point of releasing versions for the sake of releasing versions.

I know that the release process I've proposed is annoying. But for me, this process ONLY goes hand in hand with the release of new features/changes.

But that wasn't the case here!
As far as minor releases like this one are concerned, there's no need for such a process.
Instead, I'd be inclined to say that it should be communicated, with a 1-day lead time, for example. That gives everyone time to check that everything is working and to give the go-ahead.

The problem with this version lies in the fact that there has been no such communication.
The reason why I proposed the release process was precisely because this isn't the first time this has happened to us this year. I'm sorry to say it like this, but it's unacceptable: we're being made to look like fools.

Another issue
Another point I've already raised is the lack of git flow. I don't necessarily mean scrupulously respecting the doctrine, but today our dev branch is not necessarily releasable. Many PRs are split, and the work is often not completed.

I'm taking the liberty of putting the subject back on the table, because NO, additional tests won't prevent this.

The real issue
Development on Hive has slowed down, but I don't think a more recent version would have changed things. As far as I'm concerned, if Hive needs this adapter, it needs it.
We can migrate to Isar, but apart from getting into trouble, I don't see any real, serious reason to change in the short term.

@monsieurtanuki
Copy link
Contributor Author

The point of failure was the manual build, and the manual triggering. Thus my emphasis on automatic triggering.

I know that for off-dart it's just about pressing a button (merging the "release" issue), so that anyone can do it.
Isn't that the case for Smoothie?

Or we do ad-hoc distribution (but can we keep the package name)

I guess making an APK file available somewhere would be possible - for advanced or interested users only, that would download manually.

So please don't take this as a failure, we will simply learn from our mistakes.

It's an opportunity to review and improve our processes.

I'm clearly not satisfied with how we release new app versions.

👍

As far as I'm concerned, if Hive needs this adapter, it needs it.

I'm afraid not, because we don't use / init that hive box anymore. If hive needs you to start at 0 and to populate all the id values to the max id value, it should state so explicitly and it does not.

We can migrate to Isar, but apart from getting into trouble, I don't see any real, serious reason to change in the short term.

Agreed 6 months ago, cf. #3967.
My point here was to say that here we rely on a package that won't get fixed if needed. Your fix suggestion was good enough, but if ever we have an urgent need for a bug fix, that won't happen at all with hive, and we would be in trouble.
Perhaps the migration to another "database" would make sense in the long run, with a low priority.
And the funny part is that hive, the most important package (with sqflite), is the only one we don't get flooded with dependency bumps about!

@teolemon
Copy link
Member

teolemon commented Nov 20, 2023

Additional list of action items

  • Add the build number in the version number (will probably break some CI, but needed since Android now hides build numbers)
  • Clean Android internal testers (only committed testers ready for havock)
  • Clean TestFlight testers (only committed testers ready for havock)
  • Clarify the name of the internal TestFlight list (to avoid spamming casual beta testers)
  • GitHub Action: Send to the correct list TestFlight internal (and not casual testers - probably already the current behaviour)
  • GitHub Action: Generate a build every day if there is a commit (please see point 1)

@monsieurtanuki
Copy link
Contributor Author

@teolemon Not sure if it is what you needed but you can already access the build number from FAQ / About:
Capture d’écran 2024-01-29 à 09 21 13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 💬 To discuss and validate
Development

No branches or pull requests

3 participants