Name: Techfellow
Price range: $

GPU Reliability Software Engineer

United States, New York

Permanent

Job ID: 2498

Job Description

[Up to c. $450k Comp Package | Hybrid Working]

Role Overview

We’re representing a leading quantitative trading firm operating one of the most sophisticated research and compute environments in the industry. As GPU usage continues to grow across research and trading workflows, the firm is hiring a GPU Reliability Software Engineer to improve how GPU infrastructure is monitored, managed, debugged and optimised at scale.

This is a software engineering role with a strong systems and infrastructure focus. You’ll build Python tooling to improve visibility across the GPU fleet, analyse workload behaviour, automate operational workflows, and help engineering teams get better performance and reliability from GPU-backed infrastructure. The role is well suited to someone who enjoys working close to hardware, Linux systems and large-scale automation, while still writing clean, maintainable software...

Key Responsibilities

Build Python-based tooling to improve GPU fleet management, monitoring, metrics collection and operational automation
Develop software features that streamline systems engineering workflows across provisioning, maintenance and infrastructure visibility
Investigate complex GPU-related issues spanning hardware, drivers, operating systems, applications, networking and kernel-level behaviour
Analyse GPU workload and job telemetry to identify inefficiencies, recurring issues and opportunities for optimisation
Partner with research, trading and infrastructure teams to improve how GPU resources are consumed across the business
Improve observability across GPU environments, helping teams understand utilisation, performance and reliability trends
Contribute to automation around network configuration, systems maintenance and infrastructure operations
Debug production issues quickly and methodically, balancing immediate resolution with longer-term platform improvements
Work across engineering teams to improve GPU efficiency, reduce operational friction and support continued infrastructure growth

What You’ll Bring…

3-8 years’ experience in software engineering, systems engineering, infrastructure engineering or GPU-focused reliability work
Strong Python development skills, with experience building automation, tooling or internal platforms
Hands-on experience managing, deploying, tuning or troubleshooting GPU hardware in production or research environments
Strong understanding of Linux / UNIX systems, including troubleshooting at OS and system level
Solid computer science fundamentals and awareness of software design patterns
Experience using automation to improve reliability, efficiency and repeatability across infrastructure workflows
Ability to debug issues across software, hardware and infrastructure layers
Familiarity with configuration management, monitoring or observability tooling
Exposure to CI/CD workflows and release automation
Familiarity with open-source software and modern engineering practices
Degree in Computer Science or a related technical discipline, or equivalent practical experience
(Preferred) Experience with Debian-based environments
(Preferred) Understanding of networking protocols and systems-level network troubleshooting

...

Apply for this role

All fields marked with * are required.

Your Name *

Your Email *

Your Nationality *

Contact Number *

I confirm I have a pre-existing right to work in the role’s location *

I require visa sponsorship now or will require it in the future

Upload your CV (PDF or Word file only) *

GPU Reliability Software Engineer

Apply for this job today