Related skills
terraform linux bash aws python📋 Description
- Act as primary or escalation responder in a 24x7 on-call rotation
- Lead or support Major Incident (MI) response, including triage, mitigation, and resolution
- Coordinate across Engineering, Infrastructure, Security, and Product teams
- Execute and improve runbooks, playbooks, and escalation paths
- Drive blameless post-incident reviews (PIRs) and track corrective actions
- Monitor reliability through dashboards and observability; own service health across infrastructure, applications, and dependencies
🎯 Requirements
- Strong Linux systems administration; incident management and production support
- Cloud infrastructure (AWS, Azure, GCP) and containers (Docker, Kubernetes)
- Monitoring/alerting and observability platforms (Grafana/Prometheus/Datadog/CloudWatch)
- Scripting or programming in Python, Bash, Go; Infrastructure as Code (Terraform, Ansible)
- Networking fundamentals (DNS, TCP/IP, load balancing) and 24x7 NOC/production ops experience
- Incident response mindset; runbooks, PIRs, and collaboration with cross-functional teams
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!