Related skills
azure pagerduty aws grafana prometheus๐ Description
- Drive SLO adherence via metric monitoring and error budgets.
- Ensure alert acknowledgment and escalate for SOP gaps.
- Coordinate site deployments after SRE sign-off; monitor rollout health.
- Collaborate with SRE on planning, risk, troubleshooting, post-release optimizations.
- Refine alert triage processes, SOP docs, and automation to reduce toil.
- Mentor staff on SLO decisions; audit workflows; provide KPI reports.
๐ฏ Requirements
- 6-8 years in operations/reliability; 2+ years supervising 24x7 teams.
- Experience in 24x7 SaaS/cloud support operations.
- Proficiency with Prometheus, Grafana, Splunk, PagerDuty.
- ITIL frameworks, SRE principles; AWS, Azure, GCP.
- Process improvement, shift handoff, and observability basics.
- Bachelor's degree; ITIL Foundation certification is a plus.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest โ finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!