Related skills
sre terraform aws python kubernetesπ Description
- Own one or more reliability domains end-to-end (observability, incidents, performance).
- Drive and refine modern SRE practices across services (SLIs/SLOs, error budgets, reviews).
- Lead multi-sprint, multi-engineer reliability initiatives with cross-team coordination.
- Design and maintain end-to-end observability (metrics, logs, traces, dashboards, alerts).
- Partner with product/engineering to design reliable services and influence architecture.
- Evolve and operate AWS infrastructure with IaC workflows and contribute code.
π― Requirements
- 8+ years operating complex, production SaaS systems and reliability initiatives.
- Proven experience leading multi-sprint, multi-engineer projects with impact.
- Experience leading org-wide reliability or performance initiatives end-to-end.
- Strong software engineering in Python or Node.js/TypeScript.
- Deep expertise in observability and monitoring (Datadog/Prometheus/Grafana).
- AWS in production with Terraform and container platforms (ECS/EKS/Kubernetes).
- Incident management experience: coordinating incident response and follow-ups.
π Benefits
- Generous equity grant, become an owner in the company.
- MacBook computer provided.
- Comprehensive benefits package.
- Flexible PTO and hybrid work schedules.
- Work-from-home stipend.
- Hubs in LA, SF, Toronto, and Raleigh with hybrid days and lunch.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!