Understanding Reliability Risk
I started to read the « Google SRE » during my paid off from Nov 2021. Lots of suggestions, guidance, and knowledge are included in this book. I would like to understand and summarize key thoughts that are born from Google SRE. I can reference them to think about each of my decision to check if I am in the wrong direction or not.
This blog will include my understanding of the second chapter – Principles .
Previously, when I worked as a troubleshooter and leader of the Reliability Committee Group within my team, reliability in my mind usually means
- Making each outage can be trackable (e.g. root cause, long/short-term next action)
- Making each feature in weekly release is tested and can be rollbacked in prod env when something is wrong
- Making our Product (KaaS platform) maintained well by setting up on-call shift
- Making reliability events that happened in the last two weeks sync up with team members in the meeting.
Those impressions of reliability come from my troubleshooting and outage handling experience. In another word, they come from the actual work, not thinking deeply after I finished work. So, sometimes, when I worked as SRE, I know which actual work I should do, but I don’t know what’s reliability target that my work should achieve at the abstraction level. Is it extreme reliability of our platform, like 99.9999999% service reliability? Is it an exciting feeling after locating the root cause of each outage or troubleshooting case? Those questions let me feel not good. Because it means I just work hard but don’t know what’s the real problem I should resolve and what’s the target I should achieve.
After I read the “Embracing Risk” part and talked to my ex-manager a lot, I got some hints about the reliability
- Reliability Engineering is not a specific product. But it is a necessary piece during the life cycle of the product. In general, the target of Reliability Engineering should be making the product can serve internal/external users continuously in reasonable quality.
- For business product development, it needs fast release velocity and an efficient delivery approach. For reliability engineering, it needs to make sure the product works well based on predefined SLO and error budget. They may be conflict sometimes. However, if the reliability requirement becomes the obstacle to business product growth, it’s meaningless. In one word: No product, No reliability.
- My understanding of the general reliability target is: Using some tech/management staff to make the product reliable based on that balance between product release velocity and reliability. This balance should be adjustable. It can be strict or loose based on the actual situation, like product requirements and human resources.
If we would like to “manage” risk, the first thing we should do is find an effective way to measure “risk”. We could understand it as an “actual state” in the Kubernetes world. Then, we can discuss with the developer to figure out the “desire state”. It’s a “risk threshold”. Finally, we should start to think about how to make service work on the top of threshold.
Risks of a program or service may come from different factors, like hardware failure, human operation error, etc. It’s hard to track each factor in an exact metric. In that case, we can think about it from top to bottom. All of the risks in our mind mean something that will cause services or programs unreliable/unavailable. For measuring the availability of services, we have lots of methods, like measuring the downtime of service during the specific time window, measuring the ratio of failed requests on the service API. We can call them “unplanned downtime”.
We usually use nines to indicate the availability level of services, like 99.99%, 99.999%. It’s calculated in different ways based on type of services.
Identify “minimum desire state” of service
For confirming the “minimum desire state” of customer faced service, we should discuss with product owner of services to figure out some questions first
- What’s the type of services? like API service or data hanlding service, stateless or stateful service, long-live or short-live.
- What are dependency services that the main service relies on
- What’s severity of service in the company scope, like SEV1, SEV2, SEV3
- What’s the frequency of releasing
- Which factors may affect the availability/reliability of services? like, failure request rate, QPS, resource capacity
- Do those factors cause the same impact on service or not? like the impact of high response time during 5 mins and completely down just 5 secs are same or not
- What’s the minimum reliability requirement based on the factors above? like 99.999% successful request rate within 1 month or half a year
For making service work in minimum desired state at least, Generally, it usually has different solutions that have diff pros/cons. One of the most important considerations to decide which one we will use is “cost of solution”. Two questions can be asked for product owner and SRE
- Which/how many resources are needed to improve reliability level of services to the desired state?
- Does revenue(include existing and new that is gained after reliability improvement is done) of services offset the cost of resources above?
“Cost” of improvement solution sometimes let product owner re-think the service “desire-state”. It might be lower or higher than before. Nothing is free, we have to do the trade-off in here.
For example, as a cloud service provider, I provided an experimental env of CI/CD SaaS service to users. They can do everything in there. But any issue of exp env will be fixed within 1 week. It’s slow because it’s free. That’s fair. But, in prod env, we promise a 99.999% reliability level and provide 24h*7d on-call support. It may need 2 engineers every week and some development tasks. Once the revenue of prod env makes sense for us, this cost is acceptable. We also can choose a higher “desire state” on exp env. But it’s unacceptable when you consider revenue and cost.
After we identified what’s a risk, how to measure risk in services (actual state), and what’s mininum desired service level (desire state), we should have a strategy to guide us on how to manage risk. The reason why we mentioned “manage” not “reduce” or “remove” is we have to accept low service reliability sometimes. For example, if one service is in the early phase, it needs much space and time to “make mistakes” and learn from mistakes, so that it can go in the correct direction. In that case, we shouldn’t set a strict service quality level for it to limit its release frequency.
As SRE, we should care about the minimum service desired state and the gap between it and the actual service state by indicators. We usually call “minimum service desired state” as SLO, gap as “Error Budget”.
Once the error budget isn’t used completely or close to 0, it’s acceptable to take some risks in the new release. Otherwise, SRE will require produce team to lower the release frequency or improve the test coverage and code quality, so that SLO will not be broken.
Error Budget / SLO
SLO is more like an anchor. It makes product team and SRE team on the same page about “minimum service quality”. Once SLO is broken, it will cause some bad consequences, like losing money and users' trust.
Error Budget is more like a tool. It lets SRE and product owners can see how many mistakes that new changes can make without affecting service quality. Once it is closing to 0, product team should stop new feature development and focus on reliability improvement, like clear tech debt, improve unit test, to earn a new error budget.
When I worked in a private cloud team as a developer and SRE, our product is more like a “resource platform”, not PaaS. We developed a platform to provide compute/storage resources to internal developers in a convenient way.
I hope I can get the root cause of each reliability issue on our platform, regardless of impact and issue scale. Because I think my responsibility is making our platform works in stable/reliable env. But I never think if that’s what users really would like to have.
I have to admit, “extreme reliability thought” usually lets me take too much time (negative meaning) to evaluate a new feature of some software so that lots of internal developers don’t want to wait for us for a long time and try to set it up by themselves. Lacking considering balance makes me go in the opposite direction with users. In one word, I resolved fake issues that are created by myself sometimes. I have to correct myself in my new journey.