Staff Site Reliability Engineer

Related skills

datadog node.js terraform aws prometheus

πŸ“‹ Description

  • Own end-to-end reliability domains: strategy, roadmap, execution.
  • Drive SRE practices: SLIs/SLOs, error budgets, reviews.
  • Lead multi-engineer reliability initiatives across teams.
  • Design and maintain observability: metrics, logs, traces, alerts.
  • Partner with product/engineering on reliable services and capacity.
  • Contribute tooling and code; use AI/LLMs to accelerate delivery.

🎯 Requirements

  • 8+ years operating complex SaaS and production systems.
  • Led multi-engineer, multi-sprint reliability/perf initiatives.
  • Led at least one org-wide reliability or performance initiative.
  • Deep expertise in observability, incident management, or data/search.
  • Strong software engineering: Python or TS/Node.js; AI tooling.
  • AWS production experience with Terraform IaC and Kubernetes (ECS/EKS).

🎁 Benefits

  • Generous equity grant; own part of the company.
  • Macbook provided.
  • Comprehensive benefits package.
  • Flexible PTO and hybrid work schedules.
  • Work from home stipend.
  • Hubs in LA, SF, Toronto, Raleigh with hybrid schedules and lunch.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’