Principal Site Reliability Engineer - Core Platform
Job Description
[c. $400-500k Comp Package | Hybrid Working]
Role Overview
We’re partnering with a leading multi-strategy investment firm as it continues to scale its platform engineering and reliability capability across global trading systems. This hire will operate as a senior individual contributor within the core SRE function, responsible for shaping reliability standards, driving platform improvements, and influencing engineering practices across the organisation. This is a high-impact role that sits at the intersection of platform engineering, production reliability, and distributed systems design. You will work closely with engineering leadership and senior developers to define reliability priorities, translate them into actionable plans, and deliver improvements that raise the operational maturity of the platform.
The role requires a balance of technical depth and influence - setting direction through hands-on delivery, guiding teams on reliability trade-offs, and embedding best practices across both cloud and on-prem environments. It is not a support-driven position; it is focused on building systems, standards, and tooling that improve how the organisation operates at scale...
Key Responsibilities
- Define and drive adoption of reliability standards across platform and application teams
- Partner with engineering leadership to shape reliability strategy and translate it into executable plans
- Establish and operationalise SLOs, SLIs, and error budgets, guiding teams on trade-offs between reliability, performance, and cost
- Design and evolve observability capabilities, improving visibility into system behaviour, latency, and failure modes
- Build and enhance monitoring ecosystems using tools such as Prometheus, Grafana, Loki, Tempo, and OpenTelemetry
- Improve the reliability of Kubernetes-based production systems through best-practice configuration and capacity planning
- Develop automation and self-service tooling to improve deployment safety, recovery workflows, and operational efficiency
- Contribute to incident management practices, including leading by example during on-call rotations and driving meaningful post-incident improvements
- Collaborate with global platform and SRE teams to align on standards, patterns, and long-term platform direction
- Influence engineering teams to adopt scalable, reliable design patterns across the software lifecycle
What You’ll Bring…
- 7-11 years’ experience in Site Reliability Engineering, Production Engineering, or similar roles within complex distributed systems
- Deep expertise in observability tooling including Prometheus, Grafana, Loki, Tempo, and OpenTelemetry
- Strong experience operating and improving Kubernetes-based production environments
- Hands-on experience across both cloud platforms (AWS preferred) and on-premise infrastructure
- Strong scripting or programming capability in Python, Go, or Bash
- Solid understanding of CI/CD pipelines, DevOps practices, and modern software delivery models
- Experience defining and implementing SLOs, SLIs, and error budgets in production environments
- Proven ability to influence engineering teams and drive adoption of reliability practices
- Strong communication skills, with the ability to work effectively across technical and non-technical stakeholders
- A proactive, ownership-driven mindset with the ability to operate independently in high-impact environments
...
Apply for this role
All fields marked with * are required.