ARE YOU SETTING ALERTS BASED ON BUSINESS NEEDS?

ARE YOU SETTING ALERTS BASED ON BUSINESS NEEDS?

I wanted to start off this blog with a real-world problem one of our content delivery customers faced. Let’s name this customer “XYZ”. XYZ runs 10 node Kubernetes clusters in each of their regions and they have about four regions spread across different geographies. To monitor these clusters, they have three different tools, one from open source(Prometheus), another is a popular vendor and the third one is a tool from a startup. All these tools together spit out about 100 alerts per week and most of them are related to transient problems and these alerts become meaningless as soon as the problem ceases to exist. So, XYZ stopped reacting to alerts when this situation persisted over a period of time. XYZ reacted to alerts if and only if all the 3 tools reported the same problem, otherwise all other alerts just existed in the spreadsheet & Jira with no real value. 

Does this situation sound familiar to you as well?

The solution to the problem is to have a strategy on alerting rather than blindly reacting to each and every alert or not reacting at all. Google proposes Alerting based on Error budget burn rate but one needs to have an idea of SLOs and Error budgets. We have already learnt about SLOs and Error budgets in earlier blogs but would like to provide a brief intro here.

What are SLOs and SLAs?  

Service level objective (SLO) is an internal agreement between a user and service provider of how much available a service should be. SLOs are goals we set for how much availability we expect out of a system, and SLAs are the legal contracts that explain what happens if our system doesn’t meet its SLO.

What are Error Budgets?

We need to think of error budgets as a balance we have in our pocket for the amount of risk that we can take in a given period. This balance arises from not setting up the SLOs to straight 100% due to high operational costs and little to no benefit of increased reliability.

Coming to the calculation of Error Budgets, they are defined by the service level objective (SLO) itself. They imply how unavailable a particular service is allowed to be. For example, if the SLO target is 99.9% in a year, our error budget is equal to 100%-99.9% which comes out to be 0.1%.

What is the Error Budget Burn Rate?

If the organisation has set the SLO at 99.9% then it is acceptable 0.1% of times to fail before our SLO is not met. The burn rate comes into the picture to tell us how fast is the error budget being consumed and if we will be missing out on our SLO target.

If an organisation is really focused on improving reliability, the burn rate becomes an important aspect for them as defining the error budget alone does not help. Burn rate can be utilised in setting up an alerting system with good precision and detection time that will send an alert if it is possible that the SLO might be missed.

These early alerts can help the concerned team fix the issues before any serious problem arises and help them stay within their SLO limit. A burn rate of less than or equal to 1 is not a problem as the team will be able to maintain their SLO target. 

Let’s come to the calculation part of the error budget burn rate.

If the burn rate is equal to 1, that means the error budget will be exhausted at the completion of the SLO period.

If the burn rate is less than 1, it signifies that there will be a remainder of the error budget at the end of the SLO period.

It’s important to understand when the problem really arises?

If the burn rate is greater than 1, SLO is in danger and an immediate responsive measure needs to be brought in.

How do we calculate the error budget burn rate?

Error Budget Burn rate= budget consumed* period/alerting window

While using the error budget burn rate, you can decide when to get notified, it can be when 1% of your error budget has been used up in the last hour or it can be at 2% in an hour. It depends on your organisation’s need and type of SLO.

Google recommends that we get notified if the error budget consumption has been 2% in 1 hour.

Now, let’s understand how to calculate the Error budget burn rate with the help of an example.

SLO= 99.9%
Alerting window= 1 hour
Period= 30 days or 720 hours
Budget Consumed= 2% or 0.02

Error Budget Burn Rate= 0.02*720/1= 14.4
Time to fire = (1-.999) / .02 * 1 * 720 = 36

This implies that the time taken is 36 hours to fire the alert after the event has occurred.
Now, if we want the alert to fire exactly after the hour when 2% of the budget is consumed, we will calculate the burn rate by the following-

1 = (1-.999) / .02 * 1 * burn rate
1= .05* burn rate
Burn Rate= 1/.05
Burn Rate= 20

But if we allow the recommended limit of 2% till the time alert is fired, at an SLO level of 99.9% in 30 days or 720 hours. We have an error budget of 432 minutes and out of this, we would have lost 8.64 minutes in just an hour.

Alternatively, we could have alerts firing at a consumption of 0.6 minutes of error budget in an hour which will make it more accurate and will definitely keep us within the SLO limit but then the question arises is it feasible? Also, it might lead to an exponential increase in the number of alerts where a necessary action otherwise would not have been important.

What do you guys think? Let us know in the comments section below.

No Comments

Post A Comment