Software Engineer, AI Reliability

Added
25 days ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

distributed systems incident response monitoring observability reliability engineering

πŸ“‹ Description

  • Develop service level objectives for large language model serving systems, balancing availability and latency with development velocity.
  • Design and implement monitoring and observability systems across the token path.
  • Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers.
  • Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.
  • Support the reliability of safeguard model serving as part of Anthropic's safety commitments.

🎯 Requirements

  • Strong distributed systems, infrastructure, or reliability background (SRE/engineer).
  • Comfortable jumping into unfamiliar systems during incidents and driving resolution.
  • Think holistically about how systems compose and where the seams are.
  • Excellent communication and collaboration to partner across the company.
  • Diverse experience across product stacks, scaling databases, and large distributed systems.
  • Experience with AI model serving and observability tools is a plus.

🎁 Benefits

  • Competitive benefits package.
  • Optional equity donation matching.
  • Generous vacation and parental leave.
  • Flexible working hours.
  • Office space in San Francisco, New York, or Seattle.

πŸ›ƒ Visa sponsorship

Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’