Related skills
datadog python go observability incident management๐ Description
- Design resilient systems with capacity planning in mind.
- Define and measure SLOs/SLIs reflecting customer experience.
- Use Datadog and CloudWatch for signal-heavy observability.
- Configure alerting and routing via incident.io for on-call relevance.
- Improve the incident lifecycle: detection, triage, communication, follow-ups.
- Build highly available, easy-to-debug systems; minimize 2 a.m. alerts.
๐ฏ Requirements
- Bachelor's or master's in CS or equivalent industry experience
- 3+ years in SRE or software engineering
- Python and/or Go coding
- Distributed systems design to production experience
- Reliability mindset: SLOs/SLIs, error budgets, MTTR
- Observability and incident response; diagnose from logs/metrics
- Cross-functional communication; documentation and runbooks
- Operational tooling and AI fluency; IaC workflows
- Leadership and mentorship to drive reliability initiatives
๐ Benefits
- Healthcare
- Internet and cell phone reimbursement
- Learning and development stipend
- Potential travel to Mountain View HQ
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest โ finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!