GPU Reliability Software Engineer
Job Description
[Up to c. $450k Comp Package | Hybrid Working]
Role Overview
We’re representing a leading quantitative trading firm operating one of the most sophisticated research and compute environments in the industry. As GPU usage continues to grow across research and trading workflows, the firm is hiring a GPU Reliability Software Engineer to improve how GPU infrastructure is monitored, managed, debugged and optimised at scale.
This is a software engineering role with a strong systems and infrastructure focus. You’ll build Python tooling to improve visibility across the GPU fleet, analyse workload behaviour, automate operational workflows, and help engineering teams get better performance and reliability from GPU-backed infrastructure. The role is well suited to someone who enjoys working close to hardware, Linux systems and large-scale automation, while still writing clean, maintainable software...
Key Responsibilities
- Build Python-based tooling to improve GPU fleet management, monitoring, metrics collection and operational automation
- Develop software features that streamline systems engineering workflows across provisioning, maintenance and infrastructure visibility
- Investigate complex GPU-related issues spanning hardware, drivers, operating systems, applications, networking and kernel-level behaviour
- Analyse GPU workload and job telemetry to identify inefficiencies, recurring issues and opportunities for optimisation
- Partner with research, trading and infrastructure teams to improve how GPU resources are consumed across the business
- Improve observability across GPU environments, helping teams understand utilisation, performance and reliability trends
- Contribute to automation around network configuration, systems maintenance and infrastructure operations
- Debug production issues quickly and methodically, balancing immediate resolution with longer-term platform improvements
- Work across engineering teams to improve GPU efficiency, reduce operational friction and support continued infrastructure growth
What You’ll Bring…
- 3-8 years’ experience in software engineering, systems engineering, infrastructure engineering or GPU-focused reliability work
- Strong Python development skills, with experience building automation, tooling or internal platforms
- Hands-on experience managing, deploying, tuning or troubleshooting GPU hardware in production or research environments
- Strong understanding of Linux / UNIX systems, including troubleshooting at OS and system level
- Solid computer science fundamentals and awareness of software design patterns
- Experience using automation to improve reliability, efficiency and repeatability across infrastructure workflows
- Ability to debug issues across software, hardware and infrastructure layers
- Familiarity with configuration management, monitoring or observability tooling
- Exposure to CI/CD workflows and release automation
- Familiarity with open-source software and modern engineering practices
- Degree in Computer Science or a related technical discipline, or equivalent practical experience
- (Preferred) Experience with Debian-based environments
- (Preferred) Understanding of networking protocols and systems-level network troubleshooting
...
Apply for this role
All fields marked with * are required.