Name: Techfellow
Price range: $

Machine Learning Infrastructure Engineer

United States, New York

Permanent

Job ID: 2486

Job Description

[Up to c. $375k Comp Package | Office-Led Working]

Role Overview

We’re working with a leading global investment firm investing heavily in its AI capability, particularly across generative AI and advanced machine learning use cases. This position sits within a specialist engineering group responsible for building and evolving the platforms that enable large-scale model development and deployment. This is a highly technical role focused on enabling ML teams to operate effectively in production. You’ll work closely with researchers and engineers to ensure that training, experimentation, and inference can run at scale - reliably, efficiently, and with clear visibility across both cloud and on-premise environments...

Key Responsibilities

Build and evolve the underlying infrastructure that supports compute-intensive ML and GenAI workloads
Develop systems that handle model training, evaluation, inference, and data preparation at scale
Work alongside ML practitioners to improve runtime efficiency, resource utilisation, and model responsiveness
Create and maintain deployment pipelines and environment provisioning using modern automation and orchestration approaches
Introduce robust monitoring and visibility across compute workloads, with a focus on performance and cost transparency
Assess emerging tools, platforms, and hardware options to enhance system capability and scalability
Improve system resilience through automation, better operational processes, and performance tuning
Diagnose and resolve bottlenecks across distributed compute environments
Contribute to internal standards, documentation, and platform best practices

What You’ll Bring…

3-8 years’ experience in infrastructure engineering, platform engineering, or systems-focused roles supporting data or ML workloads
Strong grounding in distributed architecture and container-based environments (e.g. Kubernetes)
Experience working with major cloud providers (AWS, GCP, or Azure) in production settings
Familiarity with tooling used in ML ecosystems (such as orchestration frameworks, experiment tracking, or IaC solutions)
Solid programming ability in Python, plus exposure to a lower-level or performance-oriented language (e.g. Go, C++, Rust)
Experience supporting or optimising machine learning workloads in production environments
Strong troubleshooting skills, particularly around performance and scaling challenges
Exposure to monitoring and observability practices within high-throughput systems
(Preferred) Experience with advanced ML workloads such as reinforcement learning or large-scale experimentation platforms
(Preferred) Background in performance-critical environments (e.g. financial systems, large-scale platforms, or research compute)

...

Apply for this role

All fields marked with * are required.

Your Name *

Your Email *

Your Nationality *

Contact Number *

I confirm I have a pre-existing right to work in the role’s location *

I require visa sponsorship now or will require it in the future

Upload your CV (PDF or Word file only) *

Machine Learning Infrastructure Engineer

Apply for this job today