Systems Engineer - HPC, GPU & Agentic AI Infrastructure
Job Description
[Up to c. $700k Comp Package | Hybrid Working]
Role Overview
We’re representing a world-leading computational research organisation operating at the intersection of supercomputing, machine learning, and scientific discovery, now expanding its systems engineering team in New York. The role supports large-scale Linux, HPC, GPU, storage, networking, Kubernetes, and cloud environments used by researchers and AI-driven systems. A key focus will be maintaining on-premise compute platforms while designing secure cloud environments that isolate agentic workloads from sensitive internal data.
The organisation is open to strong systems engineers through to senior or lead-level candidates. What matters most is deep Linux expertise, infrastructure-at-scale experience, technical curiosity, and the ability to work across complex systems without being narrowly siloed...
Key Responsibilities
- Engineer and support large-scale Linux-based compute environments used for scientific, AI, and research workloads
- Help operate and improve on-premise HPC and GPU cluster infrastructure, including compute, storage, networking, and scheduling layers
- Design and maintain Kubernetes-backed environments for agentic AI workflows and distributed applications
- Contribute to secure cloud infrastructure patterns that allow AI agents and research tooling to run safely without unnecessary access to sensitive internal systems
- Support high-performance GPU platforms, large CPU clusters, and storage environments operating at petabyte scale
- Troubleshoot complex issues across Linux, networking, filesystems, distributed applications, and compute workloads
- Build automation and tooling to improve provisioning, reliability, observability, and user experience across infrastructure platforms
- Work closely with researchers, engineers, and security teams to make advanced compute resources accessible, secure, and reliable
- Contribute to architecture decisions around cloud, Kubernetes, HPC, networking, and workload isolation
- Continuously improve platform performance, scalability, and operational resilience as infrastructure demand increases
What You’ll Bring…
- 4-12 years’ experience in systems engineering, Linux infrastructure, HPC, cloud infrastructure, or large-scale platform environments
- Strong Linux fundamentals, including practical understanding of processes, networking, filesystems, permissions, performance, and troubleshooting
- Experience administering or engineering large Linux environments, ideally involving compute clusters or research infrastructure
- Experience with GPU clusters, HPC schedulers, RDMA networking, large-scale storage, or low-level systems performance
- Strong scripting or programming ability, ideally with Python, for automation and infrastructure tooling
- Hands-on exposure to Kubernetes, particularly for running distributed workloads or platform services
- Experience working with cloud infrastructure, especially where security, isolation, or scalable compute environments are important
- Understanding of high-performance or distributed systems, including compute, storage, networking, and workload orchestration
- Ability to diagnose unfamiliar technical problems across multiple layers of the stack
- Clear communication skills, with the ability to work effectively with researchers, engineers, infrastructure teams, and security stakeholders
- Strong intellectual curiosity and willingness to learn new systems, technologies, and scientific computing environments
- (Preferred) Exposure to secure workload isolation, agentic AI infrastructure, or sandboxed compute environments
- (Preferred) Experience acting as a technical lead or senior engineer within a complex infrastructure team
...
Apply for this role
All fields marked with * are required.