Observability Platform Engineer(s)

Europe, United Kingdom, London
Permanent
Job ID: 2390

Job Description


[Up to c. £300k Comp Package | Office-Led Working - 3 Days Remote Per Month]


Role Overview

We’re working with a research-driven quantitative technology firm on two remaining opportunities within their Observability platform team. The team is responsible for how telemetry is produced, transported, enriched and consumed across a highly complex, large-scale engineering environment - making Observability a core platform capability, not just a set of tools used after something breaks.

The two roles sit within the same team but carry slightly different emphasis. One is more focused on owning Observability across the full pipeline - OpenTelemetry, Prometheus, telemetry ingestion, backends, SaaS tooling and operational reliability. The other is more software/platform-led, focused on the producer side of Observability - building SDKs, libraries, collectors, integrations and shared engineering patterns that help teams emit high-quality metrics, logs and traces by default...


Key Responsibilities

  • Design, build and evolve Observability infrastructure across metrics, logs and traces, from telemetry production through to ingestion and backend consumption
  • Own and improve OpenTelemetry components, including SDKs, collectors, exporters, shared libraries and integrations used across engineering teams
  • Build reliable telemetry pipelines and data paths that improve consistency, routing, signal quality and long-term operability
  • Develop shared instrumentation patterns, APIs and “golden paths” that make it easier for teams to emit useful telemetry by default
  • Work with Prometheus-based systems, including writing and maintaining PromQL queries and improving metric quality
  • Support Observability platform deployments, migrations and integrations, including modern SaaS Observability tooling where relevant
  • Deploy and manage code and infrastructure using DevOps practices, including scripting, infrastructure as code and container-based delivery
  • Partner closely with software, platform and infrastructure teams to embed Observability expectations into service design
  • Improve incident diagnosis and recovery by expanding coverage, correlation, SLI/SLO thinking and failure analysis across telemetry sources
  • Contribute to future-looking work around streaming telemetry, event-based architectures, profiling, deeper signal collection and AI-assisted Observability
  • Take part in a measured on-call rota supporting critical Observability services as the team continues moving towards a stronger SRE model


What You’ll Bring...

Core experience across both roles:

  • 5-10 years’ experience across Observability, platform engineering, DevOps, SRE or software engineering roles in distributed production environments
  • Genuine Observability depth - not just experience using dashboards or monitoring tools at a surface level
  • Hands-on OpenTelemetry experience, ideally across SDKs, collectors, instrumentation, libraries, exporters or pipeline design
  • Strong understanding of metrics, logs and traces, including how telemetry is produced, transported, stored and consumed at scale
  • Kubernetes experience, including deploying workloads, working with Helm or understanding container-based application patterns
  • Comfort with DevOps practices, including infrastructure as code, deployment automation and operating production services
  • Exposure to SRE concepts such as SLIs, SLOs, error budgets, incident reduction and operational resilience
  • A pragmatic engineering mindset - focused on usability, reliability, adoption and long-term maintainability

For the Observability/platform-focused role:

  • Strong experience with Prometheus and PromQL, including practical use of Prometheus-based systems in production
  • Experience owning telemetry pipelines from producers through to ingestion, backend routing and ongoing platform management
  • Ability to deploy, operate or migrate Observability platforms, including modern SaaS Observability tools
  • Strong scripting ability, ideally with Python, alongside infrastructure tooling such as Terraform, Ansible or similar
  • Solid understanding of distributed systems, failure modes, performance bottlenecks and production reliability

For the software/platform-focused role:

  • Strong software engineering ability in C# and/or Python, with comfort working across both where needed
  • Experience building shared libraries, SDKs, APIs, collectors or integrations used by multiple engineering teams
  • Good understanding of software architecture and system design, beyond isolated coding tasks
  • Ability to work closely with application teams to improve telemetry quality and embed Observability patterns into services
  • Interest in shaping future tooling direction as the organisation continues moving more towards Python

(Preferred experience):

  • Experience with Kafka, event streaming or telemetry pipeline tooling
  • Exposure to profiling, eBPF-based visibility tooling, synthetic monitoring or deeper runtime Observability
  • Familiarity with AI-assisted Observability, automated signal analysis or intelligent incident diagnosis


...


Apply for this role

All fields marked with * are required.

I confirm I have a pre-existing right to work in the role’s location *
I require visa sponsorship now or will require it in the future

Back to Job Listings