AI Infrastructure Engineer - Training Platform

Added
7 days ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

terraform s3 kubernetes ray infiniband

πŸ“‹ Description

  • Architect a multi-tenant orchestration layer for GPU clusters with high utilization.
  • Design and implement scheduling primitives to optimize training job lifecycles.
  • Develop observability and automated health checks to proactively identify hardware issues.
  • Evaluate and integrate CNCF/AI tech (Ray, Kueue) for data-driven build vs. buy decisions.
  • Collaborate with Finance and Procurement to drive capacity planning.
  • Participate in on-call to ensure service availability.

🎯 Requirements

  • 5+ years in backend or infra engineering, with 2+ years on ML workloads at scale.
  • Strong programming skills in Python, Go, Rust, or C++.
  • Experience with compute management systems covering queueing, quotas, preemption, and gang scheduling.
  • Experience with distributed training infra (EFA, InfiniBand) and topology-aware scheduling.
  • Experience with distributed storage (Lustre, S3) related to training throughput.
  • Kubernetes internals expertise (CRDs, Operators, Admission Controllers) and device plugins.

🎁 Benefits

  • Comprehensive health, dental and vision coverage.
  • Retirement benefits.
  • Learning and development stipend.
  • Generous PTO.
  • Commuter stipend (may be eligible).
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’