Added
8 hours ago
Type
Full time
Salary
Upgrade to Premium to se...
Related skills
pytorch deepspeed infiniband fsdp nvlinkπ Description
- Drive AIR'sGPU training platform architecture across thousands of accelerators.
- Tackle multi-node orchestration, distribution, GPU scheduling, and data loading.
- Improve throughput and resilience of production training jobs.
- Shape APIs and CLI; collaborate with product, research, and platform teams.
- Lead end-to-end engineering from design to production rollout.
- Mentor engineers and help set Databricks' AI training direction.
π― Requirements
- 10+ years building and operating large-scale distributed systems with GPU training infra.
- Hands-on with distributed frameworks: PyTorch, FSDP, DeepSpeed, Megatron; data/tensor/pipeline/sequence parallelism.
- Strong resilience patterns: checkpointing, failure detection, automatic recovery.
- GPU performance fundamentals: NVLink, InfiniBand/RoCE, bottlenecks.
- Experience building and operating managed, multi-tenant cloud platforms with SLAs/SLOs.
- BS in Computer Science or related field; MS/PhD preferred.
π Benefits
- Comprehensive benefits and perks.
- Pay transparency information.
- Global offices and collaborative culture.
- Commitment to diversity and inclusion.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!