Related skills
datadog terraform aws grafana prometheusπ Description
- Own end-to-end reliability domains: strategy, roadmap, execution
- Drive SRE practices: SLIs/SLOs, error budgets, reviews
- Lead multi-sprint, multi-engineer reliability initiatives
- Design and maintain end-to-end observability: metrics, logs, traces, dashboards
- Be SME in at least one reliability area to guide decisions
- Partner with product and engineering to design reliable services
π― Requirements
- 8+ years operating complex SaaS systems and reliability initiatives
- Led multi-sprint, multi-engineer reliability initiatives with impact
- Led org-wide reliability/performance initiative end-to-end
- Strong software engineering: production-quality code in Python or Node.js/TypeScript
- Regularly use LLMs and AI-assisted tooling to accelerate delivery
- Deep expertise in at least one reliability domain (observability, incident management, performance, data/search)
π Benefits
- Generous equity grant; own a part of the company
- MacBook provided
- Comprehensive benefits package
- Flexible PTO and hybrid work schedules
- Work from home stipend
- Hubs in LA, SF, Toronto, Raleigh with hybrid days
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!