Site Reliability Engineer

United Kingdom, London, Remote
Permanent
Job ID: 1573

Job Description

Our client builds and maintains payment systems that are complex inside; simple outside, enabling their customers to worry less about payments. You’ll be one of the very first people to join this SRE function, which has been created by a former Google SRE leader. You’ll be given the opportunity to shape it and influence its culture – a lot of decisions about what SREs will exactly do and how they’ll work with the rest of the company are still to be made.

Customers of this organisation depend on them for their transactions, in real time, 24/7. With such responsibility, extremely high availability and resiliency is essential.


Role Responsibilities

  • Design and implement monitoring using our monitoring platform, and alerting for your services. Educate others about monitoring and alerting best practices
  • Define the release strategy (ex: canarying, release schedule) and implement it, using release automation tools
  • Define SLOs and track SLIs
  • Post-mortems. Fix the systems to prevent issues from reoccurring
  • Identify and mitigate production risks
  • You might participate in alert response as part of an on call team, this is still to be decided
  • You will define the reliability strategy and overall plan for your systems

You’ll never have to deal with office politics, nor will you be constantly firefighting with no hope of improvement for software that can be fixed. In the first 1-2 years, you’ll spend much more time on monitoring, release strategy, and automation than on changing the architecture. These are the areas that need the most attention from SRE at the moment; to be defined and implemented.

You’ll be a curious and open-minded individual who welcomes other people’s ideas. Will be a critical-thinker, not blindly following the book. Expect to deal with a lot of uncertainty, in a chaotic but always friendly environment.

To be successful in this role, you must be able to demonstrate prior responsibility for production systems over many years. You’ll bring experience within monitoring, alerting, and canarying. Most of the systems run on public cloud Kubernetes, so knowledge of this area is beneficial.

Share this role with your network

https://www.linkedin.com/in/paddyleonard/

Apply for this role

All fields marked with * are required.

Back to Job Listings