Added
14 minutes ago
Type
Full time
Salary
Upgrade to Premium to se...
Related skills
pytorch deepspeed gpu infiniband fsdpπ Description
- Drive AIR's GPU training platform architecture at scale.
- Tackle multi-node orchestration, parallelism, and scheduling.
- Improve GPU efficiency and training throughput across GPUs.
- Build resilience and observability for long-running multi-node jobs.
- Collaborate with product/research/platform to shape APIs and tooling.
- Lead end-to-end engineering from design to production rollout.
π― Requirements
- 5+ years building large-scale distributed systems incl GPU training.
- Experience with distributed training frameworks like PyTorch, FSDP, DeepSpeed, Megatron.
- Strong resilience: checkpointing, failure detection, auto recovery.
- GPU performance fundamentals: NVLink, InfiniBand, RoCE, and interconnects.
- Experience building managed multi-tenant cloud platforms with SLAs/SLOs.
- Strong CS fundamentals and system design for performance-sensitive distributed systems.
π Benefits
- Comprehensive benefits and perks for all employees.
- Region-specific benefits details via linked page.
- Diversity and inclusion commitment.
- Compliance and governance support.
- Mentorship and opportunities for learning.
- Global offices and culture of engineering excellence.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!