Staff Software Engineer, AI Runtime

Added
8 hours ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

pytorch deepspeed infiniband fsdp nvlink

πŸ“‹ Description

  • Drive AIR'sGPU training platform architecture across thousands of accelerators.
  • Tackle multi-node orchestration, distribution, GPU scheduling, and data loading.
  • Improve throughput and resilience of production training jobs.
  • Shape APIs and CLI; collaborate with product, research, and platform teams.
  • Lead end-to-end engineering from design to production rollout.
  • Mentor engineers and help set Databricks' AI training direction.

🎯 Requirements

  • 10+ years building and operating large-scale distributed systems with GPU training infra.
  • Hands-on with distributed frameworks: PyTorch, FSDP, DeepSpeed, Megatron; data/tensor/pipeline/sequence parallelism.
  • Strong resilience patterns: checkpointing, failure detection, automatic recovery.
  • GPU performance fundamentals: NVLink, InfiniBand/RoCE, bottlenecks.
  • Experience building and operating managed, multi-tenant cloud platforms with SLAs/SLOs.
  • BS in Computer Science or related field; MS/PhD preferred.

🎁 Benefits

  • Comprehensive benefits and perks.
  • Pay transparency information.
  • Global offices and collaborative culture.
  • Commitment to diversity and inclusion.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’