Related skills
terraform helm aws prometheus kubernetes📋 Description
- Investigate outages and production failures across SaaS and self-hosted environments.
- Identify recurring failure patterns; drive fixes in Go or with owners.
- Lead/post-incident reviews; document root causes and corrective actions.
- Collaborate with production engineering and SRE to develop playbooks and runbooks.
- Diagnose root causes across full stack: queues, containers, cloud networking, memory.
🎯 Requirements
- 7+ years software engineering; 3+ years infra problems in distributed systems.
- Strong Go; Python and Helm a plus.
- RabbitMQ or Kafka/ActiveMQ; queue mgmt, clustering, monitoring; observability stacks a plus.
- Kubernetes and Docker; pod lifecycle, networking, debugging.
- Incident experience; post-incident reviews; root-cause analysis; clear incident reports.
- Cloud and Linux fundamentals; AWS/Azure/GCP; logs/metrics under time pressure; cross-team comms.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!