Name: Techfellow
Price range: $

SRE Lead

United States, New York

Permanent

Job ID: 2495

Job Description

[Up to c. $500k Comp Package | Hybrid Working - 3 Days in Office]

Role Overview

We’re representing a global multi-strategy investment firm seeking an SRE Lead to take ownership of reliability engineering across a business-critical technology estate. This role will lead a distributed team across New York and London, with responsibility for improving production stability, observability, operational discipline and reliability standards across a demanding front-office and firmwide platform environment. This is a hands-on leadership role rather than a purely managerial position. The team is technically strong, but the next phase requires greater structure, clearer ownership, better prioritisation and more proactive reliability engineering. You’ll bring consistency to how incidents are reviewed, how monitoring is improved, how technical debt is reduced and how teams move from reactive support towards systematic resilience improvement...

Key Responsibilities

Lead and develop a globally distributed SRE team, providing direction, coaching and delivery discipline across New York and London
Define the operating model for SRE, including ownership boundaries, planning cadence, prioritisation methods and execution tracking
Establish reliability standards across critical services, ensuring teams adopt consistent patterns for monitoring, incident response and operational readiness
Improve alert quality, routing and ownership to reduce noise and ensure the right teams receive actionable signals
Drive stronger incident review practices, including lessons learned from outages, near-misses and recurring operational weaknesses
Partner proactively with engineering and platform teams to identify reliability gaps before they become production incidents
Create reusable tooling, templates and operational patterns that help engineering teams deploy and operate services more safely
Lead improvements across Kubernetes operations, including cluster reliability, upgrades, capacity planning, networking, access controls and workload resilience
Own reliability practices around key distributed systems, including critical messaging and streaming platforms such as Kafka
Strengthen CI/CD and GitOps enablement, working with modern delivery tooling such as GitLab and ArgoCD
Drive technical debt reduction across SRE-owned and SRE-supported services, ensuring fixes are durable rather than one-off
Participate in the on-call rotation as a senior escalation point for high-severity production issues
Track utilisation, cost and vendor performance across relevant tooling and platform services

What You’ll Bring…

8-15 years’ experience across SRE, production engineering, platform reliability or infrastructure engineering, with clear leadership responsibility
Proven experience leading senior engineers, either as a formal manager or technical lead in a high-demand production environment
Deep hands-on Kubernetes expertise, including production operations, troubleshooting, upgrades, ingress, service discovery, storage, RBAC, capacity and workload reliability
Strong experience improving observability, monitoring standards, alert quality and incident response processes
Practical background operating critical distributed systems, ideally including Kafka or comparable messaging/streaming platforms
Strong infrastructure automation experience using tools such as Terraform, Ansible or similar
Familiarity with CI/CD and GitOps workflows, ideally involving GitLab, ArgoCD or comparable tooling
Experience working across hybrid infrastructure environments, with AWS or similar public cloud exposure
Strong Linux systems knowledge, with enough breadth to collaborate across mixed infrastructure estates
Ability to prioritise effectively under pressure and bring structure to teams managing both project work and operational demand
Strong communication skills, with the ability to influence engineers, platform teams and senior technology stakeholders
(Preferred) Experience designing multi-region or multi-cluster reliability patterns
(Preferred) Exposure to disaster recovery testing, service validation or continuous reliability testing
(Preferred) Background in financial services, trading, large-scale SaaS, infrastructure platforms or other production-critical environments

...

Apply for this role

All fields marked with * are required.

Your Name *

Your Email *

Your Nationality *

Contact Number *

I confirm I have a pre-existing right to work in the role’s location *

I require visa sponsorship now or will require it in the future

Upload your CV (PDF or Word file only) *

SRE Lead

Apply for this job today

SRE Lead

Job Description

[Up to c. $500k Comp Package | Hybrid Working - 3 Days in Office]

Role Overview

Key Responsibilities

What You’ll Bring…

Apply for this role