Added
less than a minute ago
Type
Full time
Salary
Salary not provided

Related skills

datadog aws llm asr tts

πŸ“‹ Description

  • Own incident lifecycle from detection to postmortems.
  • Act as central command during major incidents and updates.
  • Define and enforce SLAs/SLOs, severity frameworks, and runbooks.
  • Collaborate with Eng, ML, and Integrations teams to resolve issues.
  • Monitor system health across integrations (LLMs, ASR/TTS).
  • Drive RCA and preventive actions.
  • Improve observability, alerting, and tooling.
  • Maintain clear internal and customer-facing communication during incidents.

🎯 Requirements

  • 3–6 years in Incident Management / SRE / Production Support.
  • Strong understanding of distributed systems, APIs, and AWS.
  • Experience with observability tools like DataDog.
  • Familiarity with AI/ML systems, especially LLM integrations and voice stacks (ASR/TTS), is a plus.
  • Experience with monitoring/tracing tools like Langfuse or similar.
  • Excellent communication and stakeholder management skills.
  • Ability to stay calm under pressure and drive structured resolution.
  • Nice to have: Exposure to OpenAI or similar LLM platforms.
  • Nice to have: Experience supporting customer-facing SaaS products.
  • Nice to have: Automation mindset (runbooks, alert tuning, incident tooling).
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Operations Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Operations Jobs

See more Operations jobs β†’