Related skills
datadog docker terraform aws prometheusπ Description
- Drive SRE across services: SLIs/SLOs, error budgets, reliability reviews
- Design and maintain end-to-end observability: metrics, logs, traces, dashboards, alerts
- Partner with product/engineering to design reliable services; review architectures and rollouts
- Evolve AWS infra (networking/compute/data stores) with Terraform IaC
- Contribute reliability code, tooling, and health checks
- Define and iterate SLIs/SLOs and error budgets with service owners
π― Requirements
- 2+ years in SRE/DevOps on production systems
- Strong SRE practices: SLIs/SLOs, error budgets, toil reduction
- Proficiency in Python or Node.js/TypeScript
- Experience with Datadog/Prometheus/Grafana/Honeycomb/New Relic
- AWS in production; Terraform IaC; Docker/Kubernetes
- Incident management experience a plus; strong communication
π Benefits
- Generous equity grant, own part of the company
- Macbook provided
- Comprehensive benefits package
- Flexible PTO and hybrid work schedules
- Work from home stipend
- Hybrid hubs in LA, SF, Toronto, Raleigh with in-office lunch
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to DevOps Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!