Investigate outages and production failures across SaaS and self-hosted environments.
Identify recurring failure patterns; drive fixes in Go or with owners.
Lead/post-incident reviews; document root causes and corrective actions.
Collaborate with production engineering and SRE to develop playbooks and runbooks.
Diagnose root causes across full stack: queues, containers, cloud networking, memory.

🎯 Requirements

7+ years software engineering; 3+ years infra problems in distributed systems.
Strong Go; Python and Helm a plus.
RabbitMQ or Kafka/ActiveMQ; queue mgmt, clustering, monitoring; observability stacks a plus.
Kubernetes and Docker; pod lifecycle, networking, debugging.
Incident experience; post-incident reviews; root-cause analysis; clear incident reports.
Cloud and Linux fundamentals; AWS/Azure/GCP; logs/metrics under time pressure; cross-team comms.

Apply on employer's website

This employer gathers applications via their own applicant tracking system.

You will be redirected to an external application form.

Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Activate JobCopilot