Added
15 days ago
Type
Full time
Salary
Upgrade to Premium to se...
Related skills
datadog terraform aws prometheus pythonπ Description
- Own end-to-end reliability domains (observability, incidents, performance)
- Drive SRE practices: SLIs/SLOs, error budgets, reliability reviews
- Lead multi-sprint, multi-engineer reliability initiatives across teams
- Design and maintain observability (metrics, logs, traces, dashboards)
- Partner with product/engineering on reliability, architecture, capacity
- Evolve AWS infra with IaC; contribute code/tools; leverage AI
π― Requirements
- 8+ years operating complex SaaS systems in production
- Proven leader of multi-sprint, multi-engineer reliability projects
- Experience leading org-wide reliability or performance initiatives
- Strong software engineering: production code in Python or TypeScript
- Deep observability expertise: metrics/logs/traces; Datadog/Prometheus
- AWS in production with IaC (Terraform) and container platforms (ECS/EKS/K8s)
π Benefits
- Generous equity grant; become an owner
- Macbook provided
- Comprehensive benefits package
- Flexible PTO and hybrid work schedules
- Work from home stipend
- Hubs in LA, SF, Toronto, Raleigh with hybrid days
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!