Added
7 days ago
Location
Type
Full time
Salary
Upgrade to Premium to se...
Related skills
terraform s3 kubernetes ray infinibandπ Description
- Architect a multi-tenant orchestration layer for GPU clusters with high utilization.
- Design and implement scheduling primitives to optimize training job lifecycles.
- Develop observability and automated health checks to proactively identify hardware issues.
- Evaluate and integrate CNCF/AI tech (Ray, Kueue) for data-driven build vs. buy decisions.
- Collaborate with Finance and Procurement to drive capacity planning.
- Participate in on-call to ensure service availability.
π― Requirements
- 5+ years in backend or infra engineering, with 2+ years on ML workloads at scale.
- Strong programming skills in Python, Go, Rust, or C++.
- Experience with compute management systems covering queueing, quotas, preemption, and gang scheduling.
- Experience with distributed training infra (EFA, InfiniBand) and topology-aware scheduling.
- Experience with distributed storage (Lustre, S3) related to training throughput.
- Kubernetes internals expertise (CRDs, Operators, Admission Controllers) and device plugins.
π Benefits
- Comprehensive health, dental and vision coverage.
- Retirement benefits.
- Learning and development stipend.
- Generous PTO.
- Commuter stipend (may be eligible).
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!