Related skills
grafana prometheus observability opentelemetryπ Description
- Lead reliability and observability strategy for Robinhood infra.
- Partner with engineers to raise operational excellence.
- Lead incident mitigation, coordinating owners and rollbacks during incidents.
- Develop and maintain incident management processes for timely resolution.
- Own incident discovery with dashboards and alerts tied to user journeys.
- Drive post-incident governance and durable reliability improvements.
π― Requirements
- 8+ years of software engineering incl. production systems.
- 4+ years in reliability engineering, infra, distributed systems, or prod ops.
- Hands-on incident leadership roles (IMOC, on-call).
- Strong communication during high-severity incidents.
- Deep knowledge of reliability, observability, fault-tolerant design.
- Familiarity with OpenTelemetry, Prometheus, Grafana.
π Benefits
- 100% paid health insurance for employees.
- 90% coverage for dependents.
- Lifestyle wallet for wellness and learning.
- Employer-paid life and disability insurance.
- Fertility benefits and mental health support.
- Paid time off, holidays, sick time, parental leave.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!