SRE Lead
Job Description
[Up to c. $500k Comp Package | Hybrid Working - 3 Days in Office]
Role Overview
We’re representing a global multi-strategy investment firm seeking an SRE Lead to take ownership of reliability engineering across a business-critical technology estate. This role will lead a distributed team across New York and London, improving production stability, observability, operational discipline and reliability standards across demanding front-office and firmwide platforms.
This is a hands-on technical leadership role, not a purely managerial position. The team is experienced, but the next phase requires someone who can bring structure, cohesion and strategic direction - moving the function from a DevOps-leaning model towards a more mature SRE discipline. You’ll need the technical gravitas to command respect from senior engineers, while working constructively with demanding business stakeholders to deliver a high-quality service. Longer term, this is a strong progression opportunity for someone capable of growing into broader platform engineering leadership....
Key Responsibilities
- Bring structure to planning, prioritisation, delivery tracking and ownership across the team
- Establish consistent SRE standards across monitoring, incident response, operational readiness and service ownership
- Improve observability, alert quality, routing, metrics and performance visibility across the environment
- Move the team towards a more proactive reliability model, reducing repeat issues and reactive support
- Partner closely with business users, platform teams and engineering groups to improve service quality and resilience
- Lead improvements across Kubernetes operations, including reliability, upgrades, capacity, networking and workload stability
- Own reliability practices around critical distributed systems, including Kafka or similar messaging platforms
- Strengthen automation, CI/CD and GitOps practices using Terraform, Ansible, GitLab and ArgoCD
- Drive technical debt reduction and ensure recurring issues are addressed with durable fixes
- Participate in on-call as a senior escalation point for high-severity production incidents
- Track utilisation, cost and vendor performance across relevant SRE-owned services
What You’ll Bring…
- 8-15 years’ experience across SRE, production engineering, platform reliability or infrastructure engineering
- Proven experience leading senior engineers, either as a formal manager or technical lead
- Strong technical credibility, with the ability to operate at or above the level of an experienced SRE team
- Deep hands-on Kubernetes expertise across production operations, troubleshooting, upgrades, networking, RBAC, capacity and workload reliability
- Strong automation and Infrastructure-as-Code experience using Terraform, Ansible or similar
- Practical coding ability, ideally in Python, for tooling, automation and workflow improvement
- Strong observability background, including monitoring standards, alert quality and incident response processes
- Experience operating distributed systems, ideally Kafka or similar streaming/messaging platforms
- Familiarity with CI/CD and GitOps workflows, ideally with GitLab, ArgoCD or comparable tooling
- Experience across hybrid infrastructure environments, with AWS or similar public cloud exposure
- Strong Linux systems knowledge and broader infrastructure troubleshooting capability
- Opinionated technical judgement, balanced with the ability to bring others along constructively
- Service-oriented mindset, with the ability to support demanding business needs while improving long-term platform quality
- (Preferred) Experience with multi-region or multi-cluster reliability patterns, disaster recovery testing, or continuous service validation
- (Preferred) Background in financial services, trading, large-scale SaaS or other production-critical environments
...
Apply for this role
All fields marked with * are required.