Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform)

Added
2 hours ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

terraform aws kubernetes ci/cd gpu

๐Ÿ“‹ Description

  • Architect and maintain Kubernetes on AWS and on-premise.
  • Build and manage IaC with Terraform for reproducible environments.
  • Design and optimize AI/ML job scheduling with Slurm on Kubernetes.
  • Provision and manage on-prem bare metal servers for GPU computing.
  • Implement networking (CNI/service mesh) and storage (CSI/S3) for hybrid workloads.
  • Develop observability and automation for operations and incidents.

๐ŸŽฏ Requirements

  • 5+ years in Platform Engineering, DevOps, or SRE
  • Hands-on Terraform experience in production
  • Expert knowledge of Kubernetes in large-scale env
  • Experience with Slurm HPC scheduler for GPU workloads
  • Experience provisioning bare metal servers (PXE MAAS) and lifecycle
  • Strong scripting skills (Python/Go/Bash)

๐ŸŽ Benefits

  • Medical, dental, vision benefits
  • Annual wellness stipend
  • Mental health support
  • Unlimited PTO
  • Generous paid parental leave
  • 401(k) plan with company match
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest โ€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs โ†’