Skip to content

adriannovegil/awesome-sre

Repository files navigation

Awesome SRE Awesome

You want your computer systems to run well, and the subjective definition of what well means depends on the nature of the system and your goals regarding it.

Most of the time, the primary motivation for companies is to create profit for the owners and shareholders.

The definition of running well will therefore be a derivative of the business model objectives.

"Hope is not a strategy."

Contents

1. Site Reliability Engineering

2. SRE Culture

3. DevOps

4. Monitoring and Observability

5. Alerting

6. Incident Response and Post-Mortem

  • A collection of post-mortems
  • A collection of postmortem templates
  • Our incident postmortem template - Hosted Graphite postmotem template.
  • Postmortem exercise
  • Squadcast - Experience the journey from On-Call to SRE.
  • PagerDuty - Your platform for digital operations management.
  • VictorOps - VictorOps is now Splunk On-Call.
  • Splunk On-Call - Developers, devops and operations teams make on-call suck less while reducing mean time to acknowledge and restore outages.
  • OpsGenie - On-call and alert management to keep services always on.
  • AlertOps - Transform real-time operational intelligence into automated incident response.
  • Blameless - The Blameless SRE Platform empowers engineering and DevOps teams through incidents, retrospectives, and detecting the interesting patterns. With the right data, of course.
  • OnPage - Incident alert management system with a secure smartphone app, enabling response teams to get the most out of their digital technology investments.
  • PagerTree - Intelligent alert routing for the modern team.
  • Cabot - Get alerted when services go down or metrics go crazy.
  • xMatters - Automate operations workflows, ensure applications are always working, and deliver remarkable products at scale with the xMatters service reliability platform.
  • Derdack Enterprise Alert - Enterprise Alert Notification Software.
  • Bigpanda - AIOps Event Correlation and Automation platform enables Tech Ops teams to keep the digital economy running.
  • OpenDuty - Openduty is an incident escalation tool, just like Pagerduty (no longer maintaining).
  • ngDesk - ngDesk includes support, sales, asset management, marketing and pager in an all-in-one application that is ready to go and easy to use.
  • Geneos - Real-time monitoring for all your environments in one platform.
  • FireHydrant - Gives teams the tools to maintain service catalogs, respond to incidents, communicate through status pages, and learn with retrospectives.
  • Rootly - The fastest way to declare an incident.

7. On-Call

8. Chaos Engineering

9. Automation

10. Performance

11. Tools

  • SLO Generator - Tool to compute and export Service Level Objectives (SLOs), Error Budgets and Burn Rates, using configurations written in YAML (or JSON) format.
  • SLO Computer - SLOs, Error windows and alerts are complicated. Here's an attempt to make it easy.
  • SLO Tracker - A simple but effective way to track SLO's and Error budgets. SLO-tracker can be integrated with few alerting tools via webhook integration to receive SLO voilating incidents.
  • SLO exporter - Computes standardized Service Level Indicator (SLI) and Service Level Objectives (SLO) metrics based on events coming from various data sources.
  • Pyrra - Making SLOs with Prometheus manageable, accessible, and easy to use for everyone.

12. Books

13. References

14. License

CC0

15. Contributing

Contributions welcome! Read the contribution guidelines first.

Thank you!

Releases

No releases published

Packages

No packages published