Related skills
distributed systems incident response monitoring observability reliability engineeringπ Description
- Develop service level objectives for large language model serving systems, balancing availability and latency with development velocity.
- Design and implement monitoring and observability systems across the token path.
- Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers.
- Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.
- Support the reliability of safeguard model serving as part of Anthropic's safety commitments.
π― Requirements
- Strong distributed systems, infrastructure, or reliability background (SRE/engineer).
- Comfortable jumping into unfamiliar systems during incidents and driving resolution.
- Think holistically about how systems compose and where the seams are.
- Excellent communication and collaboration to partner across the company.
- Diverse experience across product stacks, scaling databases, and large distributed systems.
- Experience with AI model serving and observability tools is a plus.
π Benefits
- Competitive benefits package.
- Optional equity donation matching.
- Generous vacation and parental leave.
- Flexible working hours.
- Office space in San Francisco, New York, or Seattle.
π Visa sponsorship
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!