Related skills
datadog terraform aws prometheus python๐ Description
- Lead and mentor a high-performing SRE team with ownership and collaboration.
- Own availability, scalability, and performance of core systems.
- Define reliability vision; drive incident response and postmortems.
- Partner with engineering, product, and security to design resilient systems.
- Drive SLO/SLI/SLAs, capacity planning, and cost optimization.
- Promote automation and observability to reduce toil.
๐ฏ Requirements
- Bachelor's or Master's in CS, Eng, or related field.
- 8+ years in SRE/DevOps/Production Eng with 2+ years in people mgmt.
- Strong distributed systems, AWS, Kubernetes, and networking basics.
- Observability platforms (Datadog, Prometheus, OpenTelemetry) and incident mgmt.
- Scripting (Python, Go) and IaC (Terraform, Pulumi).
- Experience defining and driving SLO/SLI-based reliability strategies.
- Excellent problem-solving and leadership through incidents.
๐ Benefits
- Health coverage for full-time employees.
- Paid parental leave, generous PTO and holidays.
- Stock options and home/office equipment provided.
- Quarterly wellness days and learning programs.
- Inclusive culture with growth and development.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest โ finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!