Senior Software Engineer, AI Reliability Engineering

Added
29 days ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

distributed systems monitoring infrastructure observability gpu

πŸ“‹ Description

  • Develop SLOs for large language model serving and training systems.
  • Design and implement monitoring for availability, latency and metrics.
  • Design and implement highly available LLM serving infra for millions of users.
  • Develop automated failover and recovery for model serving across regions and clouds.
  • Lead incident response for critical AI services with rapid recovery.
  • Build cost optimization for large-scale AI infra, focusing on accelerator utilization.

🎯 Requirements

  • Extensive experience with distributed systems observability at scale.
  • Experience operating AI infrastructure: model serving, batch inference, training.
  • Proven experience implementing SLO/SLA frameworks for critical services.
  • Experience with traditional metrics (latency, availability) and AI metrics (performance/convergence).
  • Experience with chaos engineering and resilience testing.
  • Bridge ML engineers and infrastructure teams effectively.

🎁 Benefits

  • Competitive compensation and benefits.
  • Optional equity donation matching.
  • Generous vacation and parental leave.
  • Flexible working hours.
  • Lovely London office space.

πŸ›ƒ Visa sponsorship

Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’