Related skills
distributed systems monitoring infrastructure observability gpu๐ Description
- Develop SLOs for LLM serving and training
- Design and implement monitoring for availability and latency
- Build high-availability model serving infra for millions of users
- Create automated failover and recovery across regions and clouds
- Lead incident response for critical AI services
- Optimize costs focusing on GPU/TPU/Trainium utilization
๐ฏ Requirements
- Extensive experience with distributed systems observability and monitoring at scale
- Experience operating AI infrastructure, including model serving, batch inference, and training
- Proven track record implementing and maintaining SLO/SLA frameworks
- Comfort with traditional metrics (latency, availability) and AI metrics
- Experience with chaos engineering and resilience testing
- Bridge the gap between ML engineers and infrastructure teams
๐ Benefits
- Competitive compensation and benefits
- Optional equity donation matching
- Generous vacation and parental leave
- Flexible working hours
- Dublin office with a collaborative workspace
- Global, mission-driven culture
๐ Visa sponsorship
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest โ finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!