Added
18 minutes ago
Type
Full time
Salary
Upgrade to Premium to se...
Related skills
azure terraform aws grafana prometheusπ Description
- Architect, deploy, and operate large-scale LLM inference servers with low latency.
- Design cloud architectures across Azure and AWS.
- Manage GPU-enabled Kubernetes clusters and Slurm HPC for AI workloads.
- Deploy Kubernetes components and operators (GPU, ingress, CNIs, CSIs).
- Build IaC and deployment workflows with Terraform, Helm, Kustomize, ArgoCD.
- Design and maintain centralized observability with Prometheus, Grafana, Clickhouse.
π― Requirements
- 5+ years in DevOps, SRE, or ML infra for production systems.
- Azure and AWS: storage, compute, networking, and databases.
- Kubernetes admin with GPU scheduling; Slurm desirable.
- Deploy/scale/operate LLMs and inference engines (vLLM, TGI, Triton).
- DevOps tooling: Terraform, Helm, Kustomize, ArgoCD; CI (GitHub/GitLab).
- Python and Bash scripting; debugging distributed systems at scale.
π Benefits
- Diverse medical, dental and vision options
- 401k matching program
- Unlimited paid time off
- Parental leave and flexibility for all parents and caregivers
- Support of country-specific visa needs for international employees living in the Bay Area
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!