Related skills
datadog terraform github actions aws prometheus📋 Description
- Design scalable, fault-tolerant, self-healing multi-region AWS systems.
- Define and track SLOs/SLIs; drive decisions and budgets.
- Run blameless post-incident reviews; implement preventive measures.
- Build automation to eliminate manual work; create internal tools.
- Move from basic monitoring to deep observability with actionable insights.
- Refine on-call to reduce alert fatigue; improve MTTR.
🎯 Requirements
- Bachelor’s degree in Computer Engineering or related field.
- 5+ years SRE experience.
- 3+ years AWS with container orchestration.
- 2+ years Kubernetes experience.
- Observability with Prometheus, Datadog, OpenTelemetry.
- Terraform IaC; chaos engineering; incident management.
🎁 Benefits
- Remote-first work arrangement with flexible hours.
- Health Insurance coverage.
- Work-from-anywhere stipend.
- Wellness and learning credits.
- Annual all-expenses-paid company retreat.
- On-call rotation with paid standby and rest periods.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!