Observability Platform Engineer
Job Description
[Up to c. £250k Comp Package | Hybrid Working - 2/3 Days in Office]
Role Overview
We’re working with a research-driven technology organisation operating at the intersection of large-scale computing and quantitative problem solving. As their platforms continue to grow in scale and complexity, they are investing heavily in how telemetry is produced, transported, and consumed across the stack. This role sits at the core of that effort. You’ll be responsible for designing and operating the systems that allow engineers to understand how their services behave in production - at cloud scale and under constant load. Rather than focusing on individual applications, this position is about building shared observability infrastructure: standardised pipelines, instrumentation frameworks, and tooling that makes visibility reliable, predictable, and easy to adopt across hundreds of services...
Key Responsibilities
- Design and evolve the core telemetry ingestion layer that handles metrics, logs, and traces at very high volume
- Build and operate scalable data paths for observability signals, ensuring consistency, reliability, and predictable routing
- Extend and operate OpenTelemetry components, including collectors, exporters, and shared libraries used across multiple teams
- Develop opinionated instrumentation patterns (“golden paths”) to reduce friction for application and platform engineers
- Ensure Kubernetes-based services are observable by default, with clear standards for resilience and failure analysis
- Partner with platform, infrastructure, and application teams to embed observability expectations into service design
- Improve incident diagnosis and recovery by expanding coverage, signal quality, and correlation across telemetry sources
- Contribute practical industry experience to longer-term observability strategy and architectural direction
- Take part in a measured out-of-hours rota supporting critical telemetry services
What You’ll Bring...
- 5-9 years’ experience working in observability, platform engineering, or SRE roles within large, distributed systems
- Hands-on experience designing or running OpenTelemetry-based systems in production environments
- Strong understanding of cloud-native architectures and how telemetry behaves at scale
- Practical experience operating observability backends for metrics, logs, and traces
- Comfort working with Kubernetes ecosystems and production-grade container platforms
- Ability to build or extend tooling in Go, Python, or a comparable systems-oriented language
- Exposure to synthetic monitoring, event streaming, or telemetry pipelines built on Kafka
- Solid grasp of distributed systems concepts, including failure modes and performance bottlenecks
- A pragmatic engineering mindset - focused on usability, reliability, and long-term operability
- (Preferred) Experience with profiling or low-level visibility tooling such as eBPF-based systems
- (Preferred) Familiarity with emerging approaches such as AI-assisted observability or automated signal analysis
...
Apply for this role
All fields marked with * are required.