Machine Learning Infrastructure Engineer

United States, New York
Permanent
Job ID: 2486

Job Description


[Up to c. $375k Comp Package | Office-Led Working]


Role Overview

We’re working with a leading global investment firm investing heavily in its AI capability, particularly across generative AI and advanced machine learning use cases. This position sits within a specialist engineering group responsible for building and evolving the platforms that enable large-scale model development and deployment. This is a highly technical role focused on enabling ML teams to operate effectively in production. You’ll work closely with researchers and engineers to ensure that training, experimentation, and inference can run at scale - reliably, efficiently, and with clear visibility across both cloud and on-premise environments...


Key Responsibilities

  • Build and evolve the underlying infrastructure that supports compute-intensive ML and GenAI workloads
  • Develop systems that handle model training, evaluation, inference, and data preparation at scale
  • Work alongside ML practitioners to improve runtime efficiency, resource utilisation, and model responsiveness
  • Create and maintain deployment pipelines and environment provisioning using modern automation and orchestration approaches
  • Introduce robust monitoring and visibility across compute workloads, with a focus on performance and cost transparency
  • Assess emerging tools, platforms, and hardware options to enhance system capability and scalability
  • Improve system resilience through automation, better operational processes, and performance tuning
  • Diagnose and resolve bottlenecks across distributed compute environments
  • Contribute to internal standards, documentation, and platform best practices


What You’ll Bring…

  • 3-8 years’ experience in infrastructure engineering, platform engineering, or systems-focused roles supporting data or ML workloads
  • Strong grounding in distributed architecture and container-based environments (e.g. Kubernetes)
  • Experience working with major cloud providers (AWS, GCP, or Azure) in production settings
  • Familiarity with tooling used in ML ecosystems (such as orchestration frameworks, experiment tracking, or IaC solutions)
  • Solid programming ability in Python, plus exposure to a lower-level or performance-oriented language (e.g. Go, C++, Rust)
  • Experience supporting or optimising machine learning workloads in production environments
  • Strong troubleshooting skills, particularly around performance and scaling challenges
  • Exposure to monitoring and observability practices within high-throughput systems
  • (Preferred) Experience with advanced ML workloads such as reinforcement learning or large-scale experimentation platforms
  • (Preferred) Background in performance-critical environments (e.g. financial systems, large-scale platforms, or research compute)


...


Apply for this role

All fields marked with * are required.

I confirm I have a pre-existing right to work in the role’s location *
I require visa sponsorship now or will require it in the future

Back to Job Listings