Drive SLO adherence via metric monitoring and error budgets.
Ensure alert acknowledgment and escalate for SOP gaps.
Coordinate site deployments after SRE sign-off; monitor rollout health.
Collaborate with SRE on planning, risk, troubleshooting, post-release optimizations.
Refine alert triage processes, SOP docs, and automation to reduce toil.
Mentor staff on SLO decisions; audit workflows; provide KPI reports.

🎯 Requirements

6-8 years in operations/reliability; 2+ years supervising 24x7 teams.
Experience in 24x7 SaaS/cloud support operations.
Proficiency with Prometheus, Grafana, Splunk, PagerDuty.
ITIL frameworks, SRE principles; AWS, Azure, GCP.
Process improvement, shift handoff, and observability basics.
Bachelor's degree; ITIL Foundation certification is a plus.

Apply on employer's website

This employer gathers applications via their own applicant tracking system.

You will be redirected to an external application form.

Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Activate JobCopilot