Added
4 days ago
Type
Full time
Salary
Upgrade to Premium to se...
Related skills
python kubernetes go slurm ncclπ Description
- Drive technical vision for Lambda's managed Kubernetes bare-metal platform
- Integrate NVIDIA GPU Operator, DCGM, NCCL, and topology tools
- Design GPU-aware orchestration and multi-tenant Kubernetes
- Lead development of services powering our managed platform
- Build foundation for Managed Slurm on Kubernetes
- Design platform services for inference, autoscaling, multi-model deployment
π― Requirements
- 10+ years in software/platform engineering or SRE, with 5+ years on Kubernetes at scale
- Expert-level Kubernetes internals: API machinery, controllers, schedulers, CRDs, CSI, CNI
- Go and Python production-quality code
- GPU orchestration in Kubernetes: NVIDIA GPU Operator, DCGM, MIG
- Leadership: drive design decisions and mentor engineers
- Observability at scale: Prometheus, Grafana, tracing, alerting
π Benefits
- Build core platform services for AI workloads
- NVIDIA partnership with cutting-edge tooling
- Tackle massive-scale GPU clusters
- Cross-stack influence across network, storage, compute
- Competitive compensation and equity
- Health, dental, vision; wellness stipend; 401(k) match; PTO
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!