Reliability, reliability, reliability.

Sep 08, 2020

Author: Joe Pocock, Techfellow's Head of Communications

Site Reliability Engineering is integral to any technology-first company, since its conception at Google in 2003. It’s effectively a group of practices and processes that govern reliability of infrastructure, driven by software and automation, minimalising repetitive tasks and encouraging continuous innovations so not to be left behind.

In a previous post we touched on automation within Ziglu, the cryptocurrency trading platform start up. In this article we explore what SRE means to them.

Google have a strong idea about what SRE is – they’ve defined and documented it in their own bible, the SRE Handbook.

"What isn’t defined is DevOps – infrastructure and platforms engineering," says Matt Turner, Head of Platform at Ziglu.

“We call our team SRE because that’s our aspiration; not because we technically can’t be there, but because the model doesn’t necessarily work in a smaller organisation. For an SRE team to own services with certain acceptance criteria, meaning you can’t get something into production unless we’ve signed it off – I don’t think that works in a company of our size. If you reject one service based on some criteria, whatever that may be, there’s no queue of backup services behind it, so you simply can’t do anything. One service in a company of five people is vitally important; if you can’t run your code a start-up could go under.

What SRE means to me is software engineering applied to infrastructure – taking engineers with those skills and attitudes and have them happen to focus on infrastructure rather than front or backend services. We take an automation-first, high-quality software engineering disciplined approach to everything we do.”

For us at Techfellow, SRE consumes a large chunk of our business; we’ve been fortunate enough to work with ex-Googlers and original SRE pioneers who’re developing new SRE functions, now within finance and facing new regulatory and compliance challenges. It’s crucial reliability and automation persists, otherwise teams run the risk of becoming purely operations based.

As a relatively young firm, Ziglu are perfectly positioned to adhere to these principles (and those they lay down themselves) right from the get-go. Matt understands the importance of ensuring they remain engineers, rather than becoming reliant on operations.

Aspiring to operate an SRE team as Google have isn’t necessarily the case for every firm, established or new. The way reliability functions are defined can vary, which in-turn means job titles can be somewhat fluid. The market can be flooded with engineers who think they’re doing SRE but aren’t quite doing it the Google way – or often fall more into the DevOps category, relying much more on ready-made tools.

Does this present itself in a larger problem for someone like Matt when he’s looking to hire engineers who align with his views and ideals? At the end of the day, he says it all comes down to communication.

“I deliberately use SRE in our job specs because I wanted to signal that we’re using those ideas and attitudes – and I do explain in any interview that, even if we’re not following the book to the letter, I want those likeminded people.

I could have called us Cloud, Backend, and/or DevOps Engineers, but those terms are super ambiguous. When people come onto the market they equally need to play the same game. What do you call yourself? Which job have you applied for? A shared language would really help this kind of thing.”

Matt and his colleagues at Ziglu have made a conscious decision to be open and honest during the hiring process. As often if the case with SRE, they’ve seen people who are typically software engineers who want to move into infrastructure for some reason (possibly the money attached to the job), and they’ve found engineers who are more skilled with networking and servers, but lack the modern coding skills required.

"Finding that intersection is very hard; SRE is my best attempt to signal that’s what we want, and those are they people who should engage with us."

We’ve seen the term naturally evolve over the years, especially with the increase of cloud computing, much again to the help of Google, their platform, and the Kubernetes engine. Automation is a vast point of research and development outside of SRE – so how will this ultimately influence this function? Will we see a requirement for SRE being a basic, or even advanced ability to program machine learning algorithms moving forward?

“I don’t see why not – we use tools wherever they help us, right? Take the cloud for example, I used to rack and stack servers, and I could still, but there's no point when I can just click a button and get a Kubernetes cluster.

These things come in time though. If you think of a car, firstly there was no safety, then we added a seatbelt, traction control, and now sophisticated collision avoidance systems. A lot of this tech filters down from F1 into road cars, and in a very similar way a lot of our tech filters down from Google. It’s getting better all the time, but none of that is AI, it’s still traditional code, but I do definitely see the benefits of automation getting more sophisticated."

It's difficult to completely predict what the next-generation of site reliability engineering will look like, but we can certainly gain a good idea by keeping a watchful eye on Google and what they continue to deliver. Matt is clearly passionate about these processes, and who can argue with aspiring to follow in Google's footsteps?