Related skills
datadog aws llm asr ttsπ Description
- Own incident lifecycle from detection to postmortems.
- Act as central command during major incidents and updates.
- Define and enforce SLAs/SLOs, severity frameworks, and runbooks.
- Collaborate with Eng, ML, and Integrations teams to resolve issues.
- Monitor system health across integrations (LLMs, ASR/TTS).
- Drive RCA and preventive actions.
- Improve observability, alerting, and tooling.
- Maintain clear internal and customer-facing communication during incidents.
π― Requirements
- 3β6 years in Incident Management / SRE / Production Support.
- Strong understanding of distributed systems, APIs, and AWS.
- Experience with observability tools like DataDog.
- Familiarity with AI/ML systems, especially LLM integrations and voice stacks (ASR/TTS), is a plus.
- Experience with monitoring/tracing tools like Langfuse or similar.
- Excellent communication and stakeholder management skills.
- Ability to stay calm under pressure and drive structured resolution.
- Nice to have: Exposure to OpenAI or similar LLM platforms.
- Nice to have: Experience supporting customer-facing SaaS products.
- Nice to have: Automation mindset (runbooks, alert tuning, incident tooling).
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Operations Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!