Staff Site Reliability Engineer

Related skills

datadog terraform aws prometheus python

πŸ“‹ Description

  • Own end-to-end reliability domains (observability, incidents, performance)
  • Drive SRE practices: SLIs/SLOs, error budgets, reliability reviews
  • Lead multi-sprint, multi-engineer reliability initiatives across teams
  • Design and maintain observability (metrics, logs, traces, dashboards)
  • Partner with product/engineering on reliability, architecture, capacity
  • Evolve AWS infra with IaC; contribute code/tools; leverage AI

🎯 Requirements

  • 8+ years operating complex SaaS systems in production
  • Proven leader of multi-sprint, multi-engineer reliability projects
  • Experience leading org-wide reliability or performance initiatives
  • Strong software engineering: production code in Python or TypeScript
  • Deep observability expertise: metrics/logs/traces; Datadog/Prometheus
  • AWS in production with IaC (Terraform) and container platforms (ECS/EKS/K8s)

🎁 Benefits

  • Generous equity grant; become an owner
  • Macbook provided
  • Comprehensive benefits package
  • Flexible PTO and hybrid work schedules
  • Work from home stipend
  • Hubs in LA, SF, Toronto, Raleigh with hybrid days
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’