Senior Software Engineer - Infrastructure Reliability

Added
10 minutes ago
Type
Full time
Salary
Salary not provided

Related skills

terraform helm aws prometheus kubernetes

📋 Description

  • Investigate outages and production failures across SaaS and self-hosted environments.
  • Identify recurring failure patterns; drive fixes in Go or with owners.
  • Lead/post-incident reviews; document root causes and corrective actions.
  • Collaborate with production engineering and SRE to develop playbooks and runbooks.
  • Diagnose root causes across full stack: queues, containers, cloud networking, memory.

🎯 Requirements

  • 7+ years software engineering; 3+ years infra problems in distributed systems.
  • Strong Go; Python and Helm a plus.
  • RabbitMQ or Kafka/ActiveMQ; queue mgmt, clustering, monitoring; observability stacks a plus.
  • Kubernetes and Docker; pod lifecycle, networking, debugging.
  • Incident experience; post-incident reviews; root-cause analysis; clear incident reports.
  • Cloud and Linux fundamentals; AWS/Azure/GCP; logs/metrics under time pressure; cross-team comms.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs →