Added
1 day ago
Type
Full time
Salary
Upgrade to Premium to se...
Related skills
pytorch deepspeed gpu fsdp ncclπ Description
- Lead and grow a high-performing engineering team for Custom Training infra.
- Define and own the AIR product and technical roadmap.
- Collaborate with product, research, platform, infra, and customers to deliver end-to-end.
- Drive architecture for managed GPU training at scale.
- Advocate for customer needs and translate them into product impact.
- Build observability and reliability practices for long-running multi-node jobs.
π― Requirements
- 8+ years software engineering with 3+ years in management.
- Track record building and operating GPU training infrastructure at scale.
- Deep familiarity with distributed training (PyTorch, DeepSpeed) and FSDP.
- Experience with checkpointing, elastic training, and automated failure recovery.
- GPU performance fundamentals: NCCL, interconnects, memory optimization.
- BS/MS in Computer Science, Electrical Engineering, or related field.
π Benefits
- Pay range transparency; regional benefits information available.
- Diversity and inclusion commitment across Databricks.
- Compliance and export controls information.
- Regional benefits vary; details at the benefits page.
- Databricks offices worldwide; collaborative, innovative culture.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!