Related skills
terraform distributed systems observability cdk slosπ Description
- Participate in high-impact incident response with calm decision-making
- Define and evolve org-wide incident practices and reliability tooling
- Architect observability platforms for actionable insights on health and paths
- Lead reliability practices, including alerting hygiene and SLO design
- Guide teams in building resilient, fault-tolerant services
- Mentor engineers in operational rigor and reliability principles
π― Requirements
- 8+ years operating and scaling production infra in cloud-native environments
- Deep expertise in incident response, debugging distributed systems, and reliability improvements
- Strong knowledge of observability stacks (metrics, logs, traces), alerting, and SLO design
- Experience implementing fault isolation, graceful degradation, and chaos engineering
- Proficiency with IaC and config management (Terraform, CDK, etc.)
- Proven ability to influence teams through standards, tooling, and culture
π Benefits
- Flexible, hybrid work environment
- Unlimited Vacation
- 100% paid employee health benefit options (including medical, dental, and vision)
- Commuter Benefits
- 401(k) with employer funded match
- Corporate wellness program with Wellhub
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!