Design and operate GPU infrastructure for model hosting and scheduling.
Build and scale model serving with vLLM, TensorRT-LLM, Triton for real-time inference.
Implement multi-model routing across modalities on shared infrastructure.
Own end-to-end model lifecycle: download, deploy, serve, monitor, scale.
Drive inference optimization: quantization, batching, caching, cold-start reduction.
Build self-service platforms to provision compute, storage, and model endpoints via APIs.

🎯 Requirements

8+ years software engineering; 3+ years building infra platforms or ML/AI infra.
Deep experience with AWS, GCP, and Kubernetes.
Hands-on with GPU workloads and model serving (vLLM, TensorRT-LLM, Triton).
Proficiency in Python, Go, or C++.
IaC experience: Terraform, Pulumi, CDK.
Experience leading cross-team technical initiatives and influencing direction.

Apply on employer's website

This employer gathers applications via their own applicant tracking system.

You will be redirected to an external application form.

Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Activate JobCopilot