Site Reliability Engineer - AI & ML Infrastructure (Kubernetes & Terraform)

Added
1 minute ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

terraform aws python kubernetes ci/cd

๐Ÿ“‹ Description

  • Architect and maintain Kubernetes-based platform on AWS and on-prem.
  • Develop and manage infrastructure with Terraform (IaC).
  • Design and optimize AI/ML job scheduling using Slurm on Kubernetes.
  • Provision, manage, and maintain on-prem bare metal GPUs infrastructure.
  • Implement networking (CNI, service mesh) and storage (CSI, S3) for hybrid workloads.
  • Build observability stack and automate operational tasks and incident response.

๐ŸŽฏ Requirements

  • 5+ years in Platform Engineering, DevOps, or SRE.
  • Proven Terraform experience in production infra.
  • Expert Kubernetes architecture and operations in large-scale env.
  • Experience with HPC schedulers, especially Slurm, for GPU AI workloads.
  • Experience managing bare metal infrastructure (PXE, MAAS).
  • Strong scripting and automation (Python, Go, Bash).

๐ŸŽ Benefits

  • Medical, dental, vision benefits
  • Annual wellness stipend
  • Mental health support
  • Life, STD, LTD income insurance
  • Unlimited PTO
  • 401(k) plan with company match
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest โ€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs โ†’