Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform)
Related skills
terraform aws kubernetes ci/cd gpu๐ Description
- Architect and maintain Kubernetes on AWS and on-premise.
- Build and manage IaC with Terraform for reproducible environments.
- Design and optimize AI/ML job scheduling with Slurm on Kubernetes.
- Provision and manage on-prem bare metal servers for GPU computing.
- Implement networking (CNI/service mesh) and storage (CSI/S3) for hybrid workloads.
- Develop observability and automation for operations and incidents.
๐ฏ Requirements
- 5+ years in Platform Engineering, DevOps, or SRE
- Hands-on Terraform experience in production
- Expert knowledge of Kubernetes in large-scale env
- Experience with Slurm HPC scheduler for GPU workloads
- Experience provisioning bare metal servers (PXE MAAS) and lifecycle
- Strong scripting skills (Python/Go/Bash)
๐ Benefits
- Medical, dental, vision benefits
- Annual wellness stipend
- Mental health support
- Unlimited PTO
- Generous paid parental leave
- 401(k) plan with company match
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest โ finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!