Related skills
grafana prometheus observability opentelemetryπ Description
- Serve as a senior technical leader for reliability and observability strategy.
- Partner with engineers to raise operational excellence and incident response.
- Lead incident mitigation by coordinating owners, rollbacks, and traffic shifts.
- Develop and maintain incident management processes to minimize downtime.
- Own dashboards and alerts tied to CUJs, availability, and metrics.
- Drive post-incident governance and postmortem standards.
π― Requirements
- 8+ years of software engineering, including production systems.
- 4+ years in reliability engineering, infra, or production ops.
- Hands-on incident leadership roles (IMOC, on-call).
- Strong communication during high-severity incidents.
- Deep knowledge of reliability, observability, and fault-tolerant design.
- Familiarity with observability stacks (OpenTelemetry, Prometheus, Grafana).
π Benefits
- Challenging, high-impact work to grow your career.
- Performance-based pay with equity, bonuses, and 401(k) matching.
- 100% paid health insurance for employees; 90% for dependents.
- Lifestyle wallet for wellness, learning, and more.
- Life & disability insurance, fertility and mental health benefits.
- Generous time off: holidays, PTO, sick time, parental leave, and more.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!