Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insistent notifications for urgent actionable alerts #158

Closed
darkk opened this issue Sep 10, 2017 · 2 comments
Closed

Insistent notifications for urgent actionable alerts #158

darkk opened this issue Sep 10, 2017 · 2 comments

Comments

@darkk
Copy link
Contributor

darkk commented Sep 10, 2017

Stuff happens: #128, #157. Non-actionable stuff happens as well: #155. @hellais and I consider, that insistent notifications sound like something valuable.

Basically there are two options for insistent notifications: separate app annoying you till you hit [ACK] or VoIP call from robot (annoying you till you pick up the phone).

Seems, our options are pagerduty, opsgenie, victorops and alertopts if we go SaaS (three of them mention discounts for non-profits). Cabot with openduty may be good enough self-hosted solution that have no apps and use phone for insistent notifications. We can also go NIH and glue prometheus webhook directly to SIP dialer dropping escalation requirement :)

Also WMF has nice page: https://wikitech.wikimedia.org/wiki/Monitoring_package_survey#OpsView

@SuperQ
Copy link
Contributor

SuperQ commented Sep 11, 2017

I've used PagerDuty, and didn't hate it. Their mobile app has good push notifications, they also work well over other communication methods.

I've also heard good things about Victorops, but no first-hand experience.

@darkk
Copy link
Contributor Author

darkk commented Aug 30, 2018

That was discussed that once again in Rome and we decided that alerting to Slack during "business hours" is enough. Availability incidents should rather have root cause fixed and going down for ~12 hours is something we can afford.

The only system that is critical enough to wake people up is the system doing data ingestion. The plan is to improve its availability in two ways:

  1. make it fault-tolerant against availability zone failure — replicate data spool with MongoDB and use Amazon Route53 with health checks in active-active mode to do failover
  2. make it non-critical — OONI Probe should be able to work in "offline" mode to collect statistics during "Internet blackouts": it should re-upload collected data on failure, should not depend on online GeoIP services and so on.

Test helpers are stateless, so their availability may be trivially improved.

So I'm closing the issue as we plan no further actions regarding insistent notifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants