Added
less than a minute ago
Type
Full time
Salary
Salary not provided

Related skills

datadog terraform aws prometheus python

๐Ÿ“‹ Description

  • Lead and mentor a high-performing SRE team with ownership and collaboration.
  • Own availability, scalability, and performance of core systems.
  • Define reliability vision; drive incident response and postmortems.
  • Partner with engineering, product, and security to design resilient systems.
  • Drive SLO/SLI/SLAs, capacity planning, and cost optimization.
  • Promote automation and observability to reduce toil.

๐ŸŽฏ Requirements

  • Bachelor's or Master's in CS, Eng, or related field.
  • 8+ years in SRE/DevOps/Production Eng with 2+ years in people mgmt.
  • Strong distributed systems, AWS, Kubernetes, and networking basics.
  • Observability platforms (Datadog, Prometheus, OpenTelemetry) and incident mgmt.
  • Scripting (Python, Go) and IaC (Terraform, Pulumi).
  • Experience defining and driving SLO/SLI-based reliability strategies.
  • Excellent problem-solving and leadership through incidents.

๐ŸŽ Benefits

  • Health coverage for full-time employees.
  • Paid parental leave, generous PTO and holidays.
  • Stock options and home/office equipment provided.
  • Quarterly wellness days and learning programs.
  • Inclusive culture with growth and development.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest โ€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs โ†’