Principal Engineer, Cluster Orchestration

Added
less than a minute ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

kubernetes go gpu slurm sunk

📋 Description

  • Lead architecture for cluster orchestration across Kubernetes, Slurm, SUNK, and Kueue.
  • Define long-term architecture and solve scaling problems across schedulers and control planes.
  • Balance performance, reliability, cost, and complexity in AI infrastructure.
  • Lead evolution of Kubernetes-native control planes and custom operators.
  • Design workload admission, validation, and rollout, including model onboarding flows.
  • Mentor senior and staff engineers, influencing platform, security, and product teams.

🎯 Requirements

  • 15+ years building and operating large-scale distributed systems.
  • Deep knowledge of Kubernetes and Slurm internals.
  • Experience running GPU-heavy AI training, inference, or HPC workloads.
  • Strong Go and cloud-native systems development background.
  • Proven ability to set technical direction across teams without direct authority.
  • Bachelor’s or Master’s degree in a relevant field, or equivalent experience.

🎁 Benefits

  • Medical, dental, and vision insurance — 100% paid.
  • Company-paid life insurance.
  • Voluntary supplemental life insurance.
  • Short and long-term disability insurance.
  • Flexible Spending Account.
  • Health Savings Account.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs →