Name: Techfellow
Price range: $

Systems Engineer - HPC, GPU & Agentic AI Infrastructure

United States, New York

Permanent

Job ID: 2499

Job Description

[Up to c. $700k Comp Package | Hybrid Working]

Role Overview

We’re representing a world-leading computational research organisation operating at the intersection of supercomputing, machine learning, and scientific discovery, now expanding its systems engineering team in New York. The role supports large-scale Linux, HPC, GPU, storage, networking, Kubernetes, and cloud environments used by researchers and AI-driven systems. A key focus will be maintaining on-premise compute platforms while designing secure cloud environments that isolate agentic workloads from sensitive internal data.

The organisation is open to strong systems engineers through to senior or lead-level candidates. What matters most is deep Linux expertise, infrastructure-at-scale experience, technical curiosity, and the ability to work across complex systems without being narrowly siloed...

Key Responsibilities

Engineer and support large-scale Linux-based compute environments used for scientific, AI, and research workloads
Help operate and improve on-premise HPC and GPU cluster infrastructure, including compute, storage, networking, and scheduling layers
Design and maintain Kubernetes-backed environments for agentic AI workflows and distributed applications
Contribute to secure cloud infrastructure patterns that allow AI agents and research tooling to run safely without unnecessary access to sensitive internal systems
Support high-performance GPU platforms, large CPU clusters, and storage environments operating at petabyte scale
Troubleshoot complex issues across Linux, networking, filesystems, distributed applications, and compute workloads
Build automation and tooling to improve provisioning, reliability, observability, and user experience across infrastructure platforms
Work closely with researchers, engineers, and security teams to make advanced compute resources accessible, secure, and reliable
Contribute to architecture decisions around cloud, Kubernetes, HPC, networking, and workload isolation
Continuously improve platform performance, scalability, and operational resilience as infrastructure demand increases

What You’ll Bring…

4-12 years’ experience in systems engineering, Linux infrastructure, HPC, cloud infrastructure, or large-scale platform environments
Strong Linux fundamentals, including practical understanding of processes, networking, filesystems, permissions, performance, and troubleshooting
Experience administering or engineering large Linux environments, ideally involving compute clusters or research infrastructure
Experience with GPU clusters, HPC schedulers, RDMA networking, large-scale storage, or low-level systems performance
Strong scripting or programming ability, ideally with Python, for automation and infrastructure tooling
Hands-on exposure to Kubernetes, particularly for running distributed workloads or platform services
Experience working with cloud infrastructure, especially where security, isolation, or scalable compute environments are important
Understanding of high-performance or distributed systems, including compute, storage, networking, and workload orchestration
Ability to diagnose unfamiliar technical problems across multiple layers of the stack
Clear communication skills, with the ability to work effectively with researchers, engineers, infrastructure teams, and security stakeholders
Strong intellectual curiosity and willingness to learn new systems, technologies, and scientific computing environments
(Preferred) Exposure to secure workload isolation, agentic AI infrastructure, or sandboxed compute environments
(Preferred) Experience acting as a technical lead or senior engineer within a complex infrastructure team

...

Apply for this role

All fields marked with * are required.

Your Name *

Your Email *

Your Nationality *

Contact Number *

I confirm I have a pre-existing right to work in the role’s location *

I require visa sponsorship now or will require it in the future

Upload your CV (PDF or Word file only) *

Systems Engineer - HPC, GPU & Agentic AI Infrastructure

Apply for this job today