Insistent notifications for urgent actionable alerts #158

darkk · 2017-09-10T11:13:59Z

Stuff happens: #128, #157. Non-actionable stuff happens as well: #155. @hellais and I consider, that insistent notifications sound like something valuable.

Basically there are two options for insistent notifications: separate app annoying you till you hit [ACK] or VoIP call from robot (annoying you till you pick up the phone).

Seems, our options are pagerduty, opsgenie, victorops and alertopts if we go SaaS (three of them mention discounts for non-profits). Cabot with openduty may be good enough self-hosted solution that have no apps and use phone for insistent notifications. We can also go NIH and glue prometheus webhook directly to SIP dialer dropping escalation requirement :)

Also WMF has nice page: https://wikitech.wikimedia.org/wiki/Monitoring_package_survey#OpsView

SuperQ · 2017-09-11T04:23:05Z

I've used PagerDuty, and didn't hate it. Their mobile app has good push notifications, they also work well over other communication methods.

I've also heard good things about Victorops, but no first-hand experience.

darkk · 2018-08-30T11:54:23Z

That was discussed that once again in Rome and we decided that alerting to Slack during "business hours" is enough. Availability incidents should rather have root cause fixed and going down for ~12 hours is something we can afford.

The only system that is critical enough to wake people up is the system doing data ingestion. The plan is to improve its availability in two ways:

make it fault-tolerant against availability zone failure — replicate data spool with MongoDB and use Amazon Route53 with health checks in active-active mode to do failover
make it non-critical — OONI Probe should be able to work in "offline" mode to collect statistics during "Internet blackouts": it should re-upload collected data on failure, should not depend on online GeoIP services and so on.

Test helpers are stateless, so their availability may be trivially improved.

So I'm closing the issue as we plan no further actions regarding insistent notifications.

darkk closed this as completed Aug 30, 2018

This was referenced Sep 3, 2018

Separate critical alerts from non-critical alerts #182

Closed

Monitoring epic, Oct 2017 … Sep 2018 #226

Closed

darkk mentioned this issue Apr 29, 2019

b.collector down for 8.5 hours #157

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insistent notifications for urgent actionable alerts #158

Insistent notifications for urgent actionable alerts #158

darkk commented Sep 10, 2017

SuperQ commented Sep 11, 2017

darkk commented Aug 30, 2018

Insistent notifications for urgent actionable alerts #158

Insistent notifications for urgent actionable alerts #158

Comments

darkk commented Sep 10, 2017

SuperQ commented Sep 11, 2017

darkk commented Aug 30, 2018