Site Reliability Engineer - Data Services Platform

Europe, United Kingdom, London, United States, New York, Illinois, Chicago
Job ID: 2406

Job Description


[Up to c. $300k Comp Package (or equivalent) | Hybrid Working]


Role Overview

We’re working with a global, multi-strategy investment firm that operates highly data-intensive trading and research systems across cloud and on-premise environments. As part of continued investment into platform resilience, the firm is building out a dedicated reliability capability within its Core Data Services function. This role offers the opportunity to shape how reliability engineering is done, rather than inherit a rigid, fully mature SRE model. You’ll operate as a hands-on engineer while helping define standards, tooling, and ways of working that improve stability, observability, and operational maturity across business-critical data platforms. The environment spans Kubernetes, distributed services, and mixed hosting models, with a strong emphasis on automation, telemetry, and pragmatic reliability engineering. You’ll work closely with platform, cloud, and application teams - embedding reliability thinking early in the lifecycle and helping ensure that systems remain performant, scalable, and resilient under real trading workloads...


Key Responsibilities

  • Establish and evolve reliability engineering practices for core data platforms, influencing how services are designed, deployed, and operated
  • Design and expand observability capabilities across services and infrastructure, building meaningful visibility into health, latency, and failure modes
  • Define and review service reliability expectations, including availability targets, error budgets, and operational readiness within Kubernetes environments
  • Develop automation and tooling to reduce manual intervention across deployment, monitoring, recovery, and operational workflows
  • Participate in an on-call rotation (roughly one week per month), contributing to incident response and post-incident improvement efforts
  • Partner with application and platform teams to improve fault tolerance, capacity planning, and resilience through SRE practices
  • Lead or contribute to blameless post-incident reviews, translating operational issues into long-term engineering improvements
  • Support reliability across both cloud-hosted and on-premise systems, balancing performance, cost, and operational simplicity


What You’ll Bring…

  • 4+ years' experience in site reliability engineering, platform engineering, or operating distributed production systems at scale
  • Strong hands-on experience building or operating observability stacks, particularly across metrics, logs, and traces
  • Deep working knowledge of Kubernetes and containerised workloads, including reliability considerations at cluster and application level
  • Practical experience operating systems across both public cloud (AWS preferred) and on-premise infrastructure
  • Experience supporting data platforms such as relational databases or caching systems
  • Confidence writing automation in Python, Bash, or Go to improve reliability, diagnostics, or deployment workflows
  • Solid understanding of CI/CD pipelines, DevOps principles, and modern software delivery practices
  • A reliability-first mindset - comfortable balancing availability, performance, cost, and engineering effort
  • (Preferred) Familiarity with messaging or streaming technologies used in data-heavy environments
  • (Preferred) Exposure to workflow orchestration or scheduling platforms


...


Apply for this role

All fields marked with * are required.

I confirm I have a pre-existing Right to Work in this location *

Back to Job Listings