Added
less than a minute ago
Location
Type
Full time
Salary
Upgrade to Premium to se...
Related skills
grafana prometheus observability reliability opentelemetryπ Description
- Lead incident leadership and reliability strategy across Robinhood infra
- Partner with engineers to raise operational excellence and incident response
- Coordinate incident mitigation; manage rollbacks and traffic shifts
- Develop incident management processes for timely resolution
- Own global dashboards and alerts tied to CUJs and availability
- Evolve incident tooling; improve MTTD/MTTR metrics
π― Requirements
- 5+ years of software engineering experience, including production systems
- 2+ years focused on reliability engineering, infra, distributed systems, or production operations
- Hands-on incident leadership roles (IMOC, incident commander, on-call)
- Deep knowledge of reliability, observability frameworks, and fault-tolerant architecture
- Experience with multi-region or multi-cluster architectures, capacity planning, and failover
- Familiarity with modern observability stacks (OpenTelemetry, Prometheus, Grafana)
π Benefits
- Challenging, high-impact work to grow your career
- Equity, bonuses, and 401(k) matching
- 100% paid health insurance for employees; 90% for dependents
- Lifestyle wallet; flexible benefits spending
- Life and disability insurance, fertility benefits, and mental health coverage
- Generous PTO, holidays, sick leave, parental leave
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!