Name: Techfellow
Price range: $

Principal Site Reliability Engineer - Core Platform

United States, New York, Illinois, Chicago

Permanent

Job ID: 2432

Job Description

[c. $400-500k Comp Package | Hybrid Working]

Role Overview

We’re partnering with a leading multi-strategy investment firm as it continues to scale its platform engineering and reliability capability across global trading systems. This hire will operate as a senior individual contributor within the core SRE function, responsible for shaping reliability standards, driving platform improvements, and influencing engineering practices across the organisation. This is a high-impact role that sits at the intersection of platform engineering, production reliability, and distributed systems design. You will work closely with engineering leadership and senior developers to define reliability priorities, translate them into actionable plans, and deliver improvements that raise the operational maturity of the platform.

The role requires a balance of technical depth and influence - setting direction through hands-on delivery, guiding teams on reliability trade-offs, and embedding best practices across both cloud and on-prem environments. It is not a support-driven position; it is focused on building systems, standards, and tooling that improve how the organisation operates at scale...

Key Responsibilities

Define and drive adoption of reliability standards across platform and application teams
Partner with engineering leadership to shape reliability strategy and translate it into executable plans
Establish and operationalise SLOs, SLIs, and error budgets, guiding teams on trade-offs between reliability, performance, and cost
Design and evolve observability capabilities, improving visibility into system behaviour, latency, and failure modes
Build and enhance monitoring ecosystems using tools such as Prometheus, Grafana, Loki, Tempo, and OpenTelemetry
Improve the reliability of Kubernetes-based production systems through best-practice configuration and capacity planning
Develop automation and self-service tooling to improve deployment safety, recovery workflows, and operational efficiency
Contribute to incident management practices, including leading by example during on-call rotations and driving meaningful post-incident improvements
Collaborate with global platform and SRE teams to align on standards, patterns, and long-term platform direction
Influence engineering teams to adopt scalable, reliable design patterns across the software lifecycle

What You’ll Bring…

7-11 years’ experience in Site Reliability Engineering, Production Engineering, or similar roles within complex distributed systems
Deep expertise in observability tooling including Prometheus, Grafana, Loki, Tempo, and OpenTelemetry
Strong experience operating and improving Kubernetes-based production environments
Hands-on experience across both cloud platforms (AWS preferred) and on-premise infrastructure
Strong scripting or programming capability in Python, Go, or Bash
Solid understanding of CI/CD pipelines, DevOps practices, and modern software delivery models
Experience defining and implementing SLOs, SLIs, and error budgets in production environments
Proven ability to influence engineering teams and drive adoption of reliability practices
Strong communication skills, with the ability to work effectively across technical and non-technical stakeholders
A proactive, ownership-driven mindset with the ability to operate independently in high-impact environments

...

Apply for this role

All fields marked with * are required.

Your Name *

Your Email *

Your Nationality *

Contact Number *

I confirm I have a pre-existing right to work in the role’s location *

I require visa sponsorship now or will require it in the future

Upload your CV (PDF or Word file only) *

Principal Site Reliability Engineer - Core Platform

Apply for this job today