Senior Software Engineer, AI Reliability Engineering

Added
26 days ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

distributed systems monitoring infrastructure observability gpu

๐Ÿ“‹ Description

  • Develop SLOs for LLM serving and training
  • Design and implement monitoring for availability and latency
  • Build high-availability model serving infra for millions of users
  • Create automated failover and recovery across regions and clouds
  • Lead incident response for critical AI services
  • Optimize costs focusing on GPU/TPU/Trainium utilization

๐ŸŽฏ Requirements

  • Extensive experience with distributed systems observability and monitoring at scale
  • Experience operating AI infrastructure, including model serving, batch inference, and training
  • Proven track record implementing and maintaining SLO/SLA frameworks
  • Comfort with traditional metrics (latency, availability) and AI metrics
  • Experience with chaos engineering and resilience testing
  • Bridge the gap between ML engineers and infrastructure teams

๐ŸŽ Benefits

  • Competitive compensation and benefits
  • Optional equity donation matching
  • Generous vacation and parental leave
  • Flexible working hours
  • Dublin office with a collaborative workspace
  • Global, mission-driven culture

๐Ÿ›ƒ Visa sponsorship

Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest โ€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs โ†’