Related skills
distributed systems monitoring infrastructure observability gpuπ Description
- Develop SLOs for large language model serving and training systems.
- Design and implement monitoring for availability, latency and metrics.
- Design and implement highly available LLM serving infra for millions of users.
- Develop automated failover and recovery for model serving across regions and clouds.
- Lead incident response for critical AI services with rapid recovery.
- Build cost optimization for large-scale AI infra, focusing on accelerator utilization.
π― Requirements
- Extensive experience with distributed systems observability at scale.
- Experience operating AI infrastructure: model serving, batch inference, training.
- Proven experience implementing SLO/SLA frameworks for critical services.
- Experience with traditional metrics (latency, availability) and AI metrics (performance/convergence).
- Experience with chaos engineering and resilience testing.
- Bridge ML engineers and infrastructure teams effectively.
π Benefits
- Competitive compensation and benefits.
- Optional equity donation matching.
- Generous vacation and parental leave.
- Flexible working hours.
- Lovely London office space.
π Visa sponsorship
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!