Site Reliability Engineering is a process that software engineering and IT teams use to build and maintain more reliable services proactively. It’s a way to apply software development solutions to IT operations problems. From monitoring to software delivery to incident response, engineers of SRE are focused on building and monitoring anything related to production that improves service resiliency without harming development speed.
At the beginning of the journey of refining monitoring efforts, understanding where to start can be tricky. The four golden signals of SRE and monitoring are commonly used within many teams. The four golden signals are a great place to start as they can help establish the core metrics that businesses should always be tracking.
The following four golden signals are the basic, essential building blocks for any effective monitoring strategy:
1. Latency- Time is taken to search a request
The team has to define a benchmark for a reasonable latency rate. Then, they have to monitor the latency associated with successful requests and monitor that against the latency of failed requests. Tracking latency across the whole process helps to identify which services are not performing well and allows teams to detect errors faster than before.
2. Traffic- Stress from demand on the system
Depending on the business, the definition of traffic can be different. SRE teams can see how customers experience the product by monitoring real-user interactions and transport in the application. They can also see how the system holds up to changes in demand or under stress.
3. Errors- Rate of request that is falling
Site Reliability Engineering teams need to monitor the rate of mistakes happening across the entire process, even at the individual service level. SRE teams need to follow and monitor errors, whether based on manually defined logic or explicit errors like failed HTTP requests. It visualizes the true health in the eyes of a customer and takes faster actions to fix frequent mistakes.
4. Saturation- Overall capacity of the service
Teams need to monitor the utilization of their system. Most systems begin to degrade before usage hits 100%. SRE teams need to determine a benchmark for a healthy percentage of usage.
The four golden signals of SRE help to create a baseline layer of visibility into reliability. These signals are the best starting point for monitoring the health of the system. Once they establish these base-level monitoring methods, they can continue to improve system visibility from there.