SRE Lead

United States, New York
Permanent
Job ID: 2495

Job Description


[Up to c. $500k Comp Package | Hybrid Working - 3 Days in Office]


Role Overview

We’re representing a global multi-strategy investment firm seeking an SRE Lead to take ownership of reliability engineering across a business-critical technology estate. This role will lead a distributed team across New York and London, with responsibility for improving production stability, observability, operational discipline and reliability standards across a demanding front-office and firmwide platform environment. This is a hands-on leadership role rather than a purely managerial position. The team is technically strong, but the next phase requires greater structure, clearer ownership, better prioritisation and more proactive reliability engineering. You’ll bring consistency to how incidents are reviewed, how monitoring is improved, how technical debt is reduced and how teams move from reactive support towards systematic resilience improvement...


Key Responsibilities

  • Lead and develop a globally distributed SRE team, providing direction, coaching and delivery discipline across New York and London
  • Define the operating model for SRE, including ownership boundaries, planning cadence, prioritisation methods and execution tracking
  • Establish reliability standards across critical services, ensuring teams adopt consistent patterns for monitoring, incident response and operational readiness
  • Improve alert quality, routing and ownership to reduce noise and ensure the right teams receive actionable signals
  • Drive stronger incident review practices, including lessons learned from outages, near-misses and recurring operational weaknesses
  • Partner proactively with engineering and platform teams to identify reliability gaps before they become production incidents
  • Create reusable tooling, templates and operational patterns that help engineering teams deploy and operate services more safely
  • Lead improvements across Kubernetes operations, including cluster reliability, upgrades, capacity planning, networking, access controls and workload resilience
  • Own reliability practices around key distributed systems, including critical messaging and streaming platforms such as Kafka
  • Strengthen CI/CD and GitOps enablement, working with modern delivery tooling such as GitLab and ArgoCD
  • Drive technical debt reduction across SRE-owned and SRE-supported services, ensuring fixes are durable rather than one-off
  • Participate in the on-call rotation as a senior escalation point for high-severity production issues
  • Track utilisation, cost and vendor performance across relevant tooling and platform services


What You’ll Bring…

  • 8-15 years’ experience across SRE, production engineering, platform reliability or infrastructure engineering, with clear leadership responsibility
  • Proven experience leading senior engineers, either as a formal manager or technical lead in a high-demand production environment
  • Deep hands-on Kubernetes expertise, including production operations, troubleshooting, upgrades, ingress, service discovery, storage, RBAC, capacity and workload reliability
  • Strong experience improving observability, monitoring standards, alert quality and incident response processes
  • Practical background operating critical distributed systems, ideally including Kafka or comparable messaging/streaming platforms
  • Strong infrastructure automation experience using tools such as Terraform, Ansible or similar
  • Familiarity with CI/CD and GitOps workflows, ideally involving GitLab, ArgoCD or comparable tooling
  • Experience working across hybrid infrastructure environments, with AWS or similar public cloud exposure
  • Strong Linux systems knowledge, with enough breadth to collaborate across mixed infrastructure estates
  • Ability to prioritise effectively under pressure and bring structure to teams managing both project work and operational demand
  • Strong communication skills, with the ability to influence engineers, platform teams and senior technology stakeholders
  • (Preferred) Experience designing multi-region or multi-cluster reliability patterns
  • (Preferred) Exposure to disaster recovery testing, service validation or continuous reliability testing
  • (Preferred) Background in financial services, trading, large-scale SaaS, infrastructure platforms or other production-critical environments


...


Apply for this role

All fields marked with * are required.

I confirm I have a pre-existing right to work in the role’s location *
I require visa sponsorship now or will require it in the future

Back to Job Listings