Added
1 hour ago
Location
Type
Full time
Salary
Upgrade to Premium to se...
Related skills
grafana prometheus distributed systems observability reliability📋 Description
- Lead reliability and observability strategy across Robinhood’s infra
- Coordinate across engineers to raise operational excellence and incident response
- Lead incident mitigation, coordinating owners and decisions during incidents
- Develop and maintain incident management processes to minimize customer impact
- Own dashboards and alerts tied to CUJs, availability, and business impact
- Evolve incident tooling and measure MTTD/MTTR improvements
🎯 Requirements
- 5+ years of software engineering with production systems
- 2+ years in reliability engineering, infrastructure, distributed systems
- Hands-on incident leadership roles (IMOC, incident commander, on-call)
- Strong communication during high-severity incidents
- Deep knowledge of reliability, observability, fault-tolerant design
- Experience with multi-region architectures and failover strategies
🎁 Benefits
- Challenging, high-impact work to grow your career
- 100% paid health insurance for employees with 90% coverage for dependents
- Lifestyle wallet for wellness, learning, and more
- Employer-paid life & disability insurance, fertility benefits, mental health benefits
- Time off to recharge including holidays, PTO, sick time, parental leave, and more
- Exceptional office experience with catered meals, events, and comfortable workspaces
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!