Site Reliability Engineer
Our client builds and maintains payment systems that are complex inside; simple outside, enabling their customers to worry less about payments. You’ll be one of the very first people to join this SRE function, which has been created by a former Google SRE leader. You’ll be given the opportunity to shape it and influence its culture – a lot of decisions about what SREs will exactly do and how they’ll work with the rest of the company are still to be made.
Customers of this organisation depend on them for their transactions, in real time, 24/7. With such responsibility, extremely high availability and resiliency is essential.
- Design and implement monitoring using our monitoring platform, and alerting for your services. Educate others about monitoring and alerting best practices
- Define the release strategy (ex: canarying, release schedule) and implement it, using release automation tools
- Define SLOs and track SLIs
- Post-mortems. Fix the systems to prevent issues from reoccurring
- Identify and mitigate production risks
- You might participate in alert response as part of an on call team, this is still to be decided
- You will define the reliability strategy and overall plan for your systems
You’ll never have to deal with office politics, nor will you be constantly firefighting with no hope of improvement for software that can be fixed. In the first 1-2 years, you’ll spend much more time on monitoring, release strategy, and automation than on changing the architecture. These are the areas that need the most attention from SRE at the moment; to be defined and implemented.
You’ll be a curious and open-minded individual who welcomes other people’s ideas. Will be a critical-thinker, not blindly following the book. Expect to deal with a lot of uncertainty, in a chaotic but always friendly environment.
To be successful in this role, you must be able to demonstrate prior responsibility for production systems over many years. You’ll bring experience within monitoring, alerting, and canarying. Most of the systems run on public cloud Kubernetes, so knowledge of this area is beneficial.
Apply for this role
All fields marked with * are required.