Site Reliability Engineer - Data Services Platform
Job Description
[Up to c. $300k Comp Package (or equivalent) | Hybrid Working]
Role Overview
We’re working with a global, multi-strategy investment firm that operates highly data-intensive trading and research systems across cloud and on-premise environments. As part of continued investment into platform resilience, the firm is building out a dedicated reliability capability within its Core Data Services function. This role offers the opportunity to shape how reliability engineering is done, rather than inherit a rigid, fully mature SRE model. You’ll operate as a hands-on engineer while helping define standards, tooling, and ways of working that improve stability, observability, and operational maturity across business-critical data platforms. The environment spans Kubernetes, distributed services, and mixed hosting models, with a strong emphasis on automation, telemetry, and pragmatic reliability engineering. You’ll work closely with platform, cloud, and application teams - embedding reliability thinking early in the lifecycle and helping ensure that systems remain performant, scalable, and resilient under real trading workloads...
Key Responsibilities
- Establish and evolve reliability engineering practices for core data platforms, influencing how services are designed, deployed, and operated
- Design and expand observability capabilities across services and infrastructure, building meaningful visibility into health, latency, and failure modes
- Define and review service reliability expectations, including availability targets, error budgets, and operational readiness within Kubernetes environments
- Develop automation and tooling to reduce manual intervention across deployment, monitoring, recovery, and operational workflows
- Participate in an on-call rotation (roughly one week per month), contributing to incident response and post-incident improvement efforts
- Partner with application and platform teams to improve fault tolerance, capacity planning, and resilience through SRE practices
- Lead or contribute to blameless post-incident reviews, translating operational issues into long-term engineering improvements
- Support reliability across both cloud-hosted and on-premise systems, balancing performance, cost, and operational simplicity
What You’ll Bring…
- 4+ years' experience in site reliability engineering, platform engineering, or operating distributed production systems at scale
- Strong hands-on experience building or operating observability stacks, particularly across metrics, logs, and traces
- Deep working knowledge of Kubernetes and containerised workloads, including reliability considerations at cluster and application level
- Practical experience operating systems across both public cloud (AWS preferred) and on-premise infrastructure
- Experience supporting data platforms such as relational databases or caching systems
- Confidence writing automation in Python, Bash, or Go to improve reliability, diagnostics, or deployment workflows
- Solid understanding of CI/CD pipelines, DevOps principles, and modern software delivery practices
- A reliability-first mindset - comfortable balancing availability, performance, cost, and engineering effort
- (Preferred) Familiarity with messaging or streaming technologies used in data-heavy environments
- (Preferred) Exposure to workflow orchestration or scheduling platforms
...
Apply for this role
All fields marked with * are required.