Senior Software Engineer, AI Runtime

Added
14 minutes ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

pytorch deepspeed gpu infiniband fsdp

πŸ“‹ Description

  • Drive AIR's GPU training platform architecture at scale.
  • Tackle multi-node orchestration, parallelism, and scheduling.
  • Improve GPU efficiency and training throughput across GPUs.
  • Build resilience and observability for long-running multi-node jobs.
  • Collaborate with product/research/platform to shape APIs and tooling.
  • Lead end-to-end engineering from design to production rollout.

🎯 Requirements

  • 5+ years building large-scale distributed systems incl GPU training.
  • Experience with distributed training frameworks like PyTorch, FSDP, DeepSpeed, Megatron.
  • Strong resilience: checkpointing, failure detection, auto recovery.
  • GPU performance fundamentals: NVLink, InfiniBand, RoCE, and interconnects.
  • Experience building managed multi-tenant cloud platforms with SLAs/SLOs.
  • Strong CS fundamentals and system design for performance-sensitive distributed systems.

🎁 Benefits

  • Comprehensive benefits and perks for all employees.
  • Region-specific benefits details via linked page.
  • Diversity and inclusion commitment.
  • Compliance and governance support.
  • Mentorship and opportunities for learning.
  • Global offices and culture of engineering excellence.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’