Senior Engineering Manager, AI Runtime

Added
1 day ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

pytorch deepspeed gpu fsdp nccl

πŸ“‹ Description

  • Lead and grow a high-performing engineering team for Custom Training infra.
  • Define and own the AIR product and technical roadmap.
  • Collaborate with product, research, platform, infra, and customers to deliver end-to-end.
  • Drive architecture for managed GPU training at scale.
  • Advocate for customer needs and translate them into product impact.
  • Build observability and reliability practices for long-running multi-node jobs.

🎯 Requirements

  • 8+ years software engineering with 3+ years in management.
  • Track record building and operating GPU training infrastructure at scale.
  • Deep familiarity with distributed training (PyTorch, DeepSpeed) and FSDP.
  • Experience with checkpointing, elastic training, and automated failure recovery.
  • GPU performance fundamentals: NCCL, interconnects, memory optimization.
  • BS/MS in Computer Science, Electrical Engineering, or related field.

🎁 Benefits

  • Pay range transparency; regional benefits information available.
  • Diversity and inclusion commitment across Databricks.
  • Compliance and export controls information.
  • Regional benefits vary; details at the benefits page.
  • Databricks offices worldwide; collaborative, innovative culture.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’