Site Reliability Engineer - AI & ML Infrastructure (Kubernetes & Terraform)
Related skills
terraform aws python kubernetes ci/cd๐ Description
- Architect and maintain Kubernetes-based platform on AWS and on-prem.
- Develop and manage infrastructure with Terraform (IaC).
- Design and optimize AI/ML job scheduling using Slurm on Kubernetes.
- Provision, manage, and maintain on-prem bare metal GPUs infrastructure.
- Implement networking (CNI, service mesh) and storage (CSI, S3) for hybrid workloads.
- Build observability stack and automate operational tasks and incident response.
๐ฏ Requirements
- 5+ years in Platform Engineering, DevOps, or SRE.
- Proven Terraform experience in production infra.
- Expert Kubernetes architecture and operations in large-scale env.
- Experience with HPC schedulers, especially Slurm, for GPU AI workloads.
- Experience managing bare metal infrastructure (PXE, MAAS).
- Strong scripting and automation (Python, Go, Bash).
๐ Benefits
- Medical, dental, vision benefits
- Annual wellness stipend
- Mental health support
- Life, STD, LTD income insurance
- Unlimited PTO
- 401(k) plan with company match
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest โ finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!