Related skills
gpu vllm tensorrt-llm tgi📋 Description
- Lead the Model Routing & Inference team, owning the inference platform powering AI interactions.
- Own the full inference path: latency, reliability, and cost optimization at scale.
- Set technical direction for cluster management, inference optimization, and traffic egress.
- Build a platform enabling fast product work without provider complexity.
- Lead engineers, set direction, and balance latency, cost, reliability, and UX.
- Drive projects: inference gateway, model selection, GPU utilization, and routing control.
🎯 Requirements
- Led teams building high-throughput, low-latency distributed systems (inference, routing).
- Reason about cost/performance tradeoffs at scale (GPU, capacity planning) with incomplete info.
- Strong software fundamentals; shipped production systems handling millions of requests.
- Experience with model serving frameworks (vLLM, TensorRT-LLM, TGI) and load balancing.
- Experience hiring and growing teams; coaching and mentorship.
- Ability to make calls balancing reliability, cost, latency, and UX in ambiguous situations.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!