The role of Site Reliability Engineer (SRE) is still gaining ground in all organizations. In January 2022, LinkedIn listed SRE as the 21st job with the highest global demand for the past five years. That’s high enough for such a specific tech role. And looking ahead, it appears that the SRE practice will continue to be adopted only as a method to support high availability, reliability, and better digital experiences for customers. The SRE approach is also essential to meet service level agreements (SLAs) and internal service level objectives (SLOs).
I recently sat down with Kurt Anderson, SRE architect, and Emily Arnott, Blameless content writer to get their opinion on the main trends and predictions for SRE practice in 2022. According to the Blameless team, 2022 is likely to see more adoption of SRE role in companies and internal departments.
Just as manufacturing sets high standards for its quality assurance, digital applications now demand the same high reliability to meet inflated user expectations. To address these new realities, SREs have their work cut out for them: they must confront how to unify disparate DevOps tools and delve further to unearth specific user experience problems, all without increasing alert fatigue.
Increased adoption of SRE
As previously mentioned, the SRE role has seen tremendous growth in recent years and is showing no signs of slowing down. SRE as a practice was first popularized by Google in 2015. Today, they simply define it as “what happens when you ask a software engineer to solve an operational problem.” This amounts to maintaining high levels of reliability for digital services by creating failure budgets, designing SLA / SLOs, and optimizing process automation over time.
For Arnott, we’re still seeing SRE adoption take off at an exponential rate as more and more companies embrace Google’s SRE doctrine. “During the pandemic, more and more companies were forced to adopt a digital strategy. Reliability has to be key to that strategy. ”Arnott expects SRE to become a fruitful role in startups, which may have initially been intimidated by the prospect.
“The practice of SRE is a problem of scale,” Anderson explained. It can be difficult for time-strapped companies to put an SRE program in place when their attention is already so divided. However, as new requirements dictate warp speed innovation, we are seeing increased adoption of DevOps. And the SRE role, as a result, is becoming more and more imperative.
Most DevOps professionals would agree: We live in a multi-tool world. For example, multi-cluster management challenges are bringing a myriad of new tools into the fold. And computing environments are just as diverse, often combining multiple clouds and comprising of hybrid properties that host various clusters of microservices. Even within the same organizations, different teams can opt for different DevOps tools. Anderson believes SREs can lead to “more unification to a fractured adoption of DevOps in a single company.”
As companies adopt more software, there is inevitably some degree of uncontrolled expansion of tools, Anderson said. For example, there may be competing methods for monitoring the use of digital services. The platform team could use Datadog, one app team could use New Relic, and another team could use Prometheus. These tools do not speak the same language and generate different registers, creating silos.
For Arnott, the future involves accepting the innate complexity of the microservices era and using an agnostic level that can receive data from anywhere to identify significant incidents.
Observability gets a greater emphasis on UX
The SRE role is definitely a trend that has grown and will continue to grow in importance, Anderson said. And as more and more organizations go digital, it is becoming necessary; with this change comes tremendous pressure to exceed user experience expectations. “The competition and people’s expectations are so high right now,” Arnott said. “If you’re unreliable, that’s unacceptable these days.”
While DevOps focuses more on obtaining applications in production, SRE takes a more UX-centric approach. Take Netflix, for example, which has invested heavily in a core SRE team that ensures robust and reliable user-facing services. “A key element in satisfying our customers and streaming is a strong focus on reliability,” according to Netflix’s tech blog.
To meet this new paradigm, teams will require a deeper understanding of observability based on particular cohorts of users, Arnott said. SREs require precise user experience contexts, such as an increase in delays or an increase in error responses. But they also need to find out the factors that contribute to these bugs. “SLOs and error budgets are like measuring temperature, they tell you Self something’s wrong. They don’t tell you because have you got a fever.”
SRE takes its cue from production
As trustworthiness becomes more and more central to a company’s brand presence, it’s only natural for something this important to be elevated in the executive suite, Anderson explained. As a reference example, you pointed to the fact that manufacturing industries often appoint a reliability manager. In traditional manufacturing, product lines undergo statistical process control with rigorous testing to ensure quality. Similar positions exist across industries, such as chief quality officers in financial services.
The result is that as reliability becomes more and more the definition of a competitive metric, technology companies may require a similar level C-suite position to align with that idea and to direct teams to implement SRE practices. . Of course, the need for such a title will depend entirely on the size and composition of the organization.
More reliance on SLOs to avoid disrupted SLAs
A service contract (SLA) is typically a formal contract built around the reliability of a digital service with an external partner, which, if breached, can result in financial penalties or other forms of restitution. This agreement may exist between a software vendor and an end user, but it is more strictly enforced in a B2B context.
On the other hand, a service level goal (SLO) is more of an internal goal. SLOs are often set at a lower threshold than SLAs. For example, perhaps an SLA guarantees that a user’s login fails for no more than 10 in a million attempts, but the SLO is set at two in a million.
Setting up internal SLOs helps engineers keep up with modest customer experience issues first they become a serious accident which results in fines. “Some companies view SLOs as just another warning; they should be representative of the user experience, ”Anderson said. It makes sense that as SRE initiatives begin to mature, companies will slowly begin to adopt even more competitive goals to deliver superior user experiences and avoid dissatisfied partners.
Final consideration: don’t play the blame game
Much of the SRE doctrine is the concept of blameless autopsy. Because when you are in the middle of an accident, the worst thing to do is point the finger. “Blaming triggers a defense mechanism and when you’re in a defensive mode, you can’t respond well,” Anderson said.
Instead, it’s best to keep an open mind when doing these retrospectives: accept that what happened happened for a reason, and try to uncover that reason to make positive improvements. “What can we change on a systemic level to become more resilient?” Arnott asks. “What is the origin of that mistake? What information was missing? What communication should have taken place to prevent this? “
Typically, accidents do not occur intentionally or out of incompetence, but rather due to loopholes in the process. By not playing the blame game, companies are in a better position to troubleshoot as they progressively release new features. So hopefully, with the right SRE approach in place, teams can advance their digital services to meet the high stakes of today’s modern user experience expectations.