Site Reliability Engineering

Site Reliability Engineering (SRE) is a set of principles and practices that supports software delivery – keeping production systems stable and still delivering new features at speed.

About Site Reliability Engineering

Ben Treynor, Google’s mastermind behind Site Reliability Engineering, describe the site reliability, “What happens when you ask a software engineer to design an operations function.” SRE is a software engineering approach to IT operations. SRE teams use software as a tool to manage systems, solve problems, and automate operations tasks.

It takes the tasks that have historically been done by operations teams, often manually, and instead gives them to engineers or ops teams who use software and automation to solve problems and manage production systems. SRE helps teams to determine what new features can be launched and when by using service-level agreements (SLAs) to define the required reliability of the system through service-level indicators (SLI) and service-level objectives (SLO).

Automation or elimination of anything repetitive seeking return on investment.

Systems design aiming reduction of risks to availability, latency, and efficiency

Pursuing target reliability that we require, nothing more or less. Defining what is necessary is a practice by itself

Observability, as in, the ability to be able to ask random questions about the system

Site Reliability Engineering Practices

Monitoring for Observability

Incident Response: Being On-Call

Postmortem and Root-Cause Analysis

Testing + Release Procedures

Large Scale System Design: Focus on Reliability

Capacity Planning

Chaos Engineering

DevOps vs SRE

“The term DevOps emerged in industry in late 2008 and as of this writing (early 2016) is still in a state of flux. Its core principles—involvement of the IT function in each phase of a system’s design and development, heavy reliance on automation versus human effort, the application of engineering practices and tools to operations tasks—are consistent with many of SRE’s principles and practices. One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions.”

*Reference: Site Reliability Engineering: How Google Run Production Systems Book, 2018 Award Winner