title	date	resolved	resolvedWhen	section	severity
3/4 of SLO1 Error Budget consumed in 6 weeks	2022-09-01T01:00:00+02:00	true	2022-10-12T01:00:00+02:00	issue	notice

(Mostly for our internal use since the source data are not publicly available.)

During the 6 weeks since the beginning of September the SLO1 error budget dropped to 25%. After that (in the middle of October) the trend turned and now, at the end of October, it's at 50%.

When looking at metrics in our (not public) Grafana (Packit boards -> (Prod/Stg) Accepted status time) we can see that the average value (of the "accepted status time") indeed increased during the beginning of September by approx 1 second and cases of >15s started to appear so the error budget started to be consumed.

The cause is yet unknown, but the changes we did the last week of August and which could thus contribute to the problem were:

lots of SQLAlchemy-related changes - contribution unknown
switched 4 single-threaded workers to 1 multithreaded - there might be some correlation, but it's not clear because for example the first 2 weeks of October the change was reverted (i.e. there were again the 4 single-threaded workers) and even the average value seemed to drop a bit, the error budget continued to be consumed (meaning there were still some cases of >15s).

We need to continue watching the metrics, experiment with the workers (numbers and types) and give the changes more time (2 weeks at least) to be able to tell whether they change the trend.

packit/packit-service#1364
packit/packit-service#1677
packit/packit-service#1728

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2022-10-27-slo1-error-budget-drop-postmortem.md

2022-10-27-slo1-error-budget-drop-postmortem.md

Files

2022-10-27-slo1-error-budget-drop-postmortem.md

Latest commit

History

2022-10-27-slo1-error-budget-drop-postmortem.md

File metadata and controls