Name: Techfellow
Price range: $

Site Reliability Engineer - Data Services Platform

Europe, United Kingdom, London, United States, New York, Illinois, Chicago

Job ID: 2406

Job Description

[Up to c. $300k Comp Package (or equivalent) | Hybrid Working]

Role Overview

We’re working with a global, multi-strategy investment firm that operates highly data-intensive trading and research systems across cloud and on-premise environments. As part of continued investment into platform resilience, the firm is building out a dedicated reliability capability within its Core Data Services function. This role offers the opportunity to shape how reliability engineering is done, rather than inherit a rigid, fully mature SRE model. You’ll operate as a hands-on engineer while helping define standards, tooling, and ways of working that improve stability, observability, and operational maturity across business-critical data platforms. The environment spans Kubernetes, distributed services, and mixed hosting models, with a strong emphasis on automation, telemetry, and pragmatic reliability engineering. You’ll work closely with platform, cloud, and application teams - embedding reliability thinking early in the lifecycle and helping ensure that systems remain performant, scalable, and resilient under real trading workloads...

Key Responsibilities

Establish and evolve reliability engineering practices for core data platforms, influencing how services are designed, deployed, and operated
Design and expand observability capabilities across services and infrastructure, building meaningful visibility into health, latency, and failure modes
Define and review service reliability expectations, including availability targets, error budgets, and operational readiness within Kubernetes environments
Develop automation and tooling to reduce manual intervention across deployment, monitoring, recovery, and operational workflows
Participate in an on-call rotation (roughly one week per month), contributing to incident response and post-incident improvement efforts
Partner with application and platform teams to improve fault tolerance, capacity planning, and resilience through SRE practices
Lead or contribute to blameless post-incident reviews, translating operational issues into long-term engineering improvements
Support reliability across both cloud-hosted and on-premise systems, balancing performance, cost, and operational simplicity

What You’ll Bring…

4+ years' experience in site reliability engineering, platform engineering, or operating distributed production systems at scale
Strong hands-on experience building or operating observability stacks, particularly across metrics, logs, and traces
Deep working knowledge of Kubernetes and containerised workloads, including reliability considerations at cluster and application level
Practical experience operating systems across both public cloud (AWS preferred) and on-premise infrastructure
Experience supporting data platforms such as relational databases or caching systems
Confidence writing automation in Python, Bash, or Go to improve reliability, diagnostics, or deployment workflows
Solid understanding of CI/CD pipelines, DevOps principles, and modern software delivery practices
A reliability-first mindset - comfortable balancing availability, performance, cost, and engineering effort
(Preferred) Familiarity with messaging or streaming technologies used in data-heavy environments
(Preferred) Exposure to workflow orchestration or scheduling platforms

...

Apply for this role

All fields marked with * are required.

Your Name *

Your Email *

Your Nationality *

Contact Number *

I confirm I have a pre-existing right to work in the role’s location *

I require visa sponsorship now or will require it in the future

Upload your CV (PDF or Word file only) *

Site Reliability Engineer - Data Services Platform

Apply for this job today