Senior Site Reliability Engineer

Related skills

datadog terraform aws grafana prometheus

πŸ“‹ Description

  • Drive SRE practices across services (SLIs/SLOs, error budgets, reliability reviews)
  • Design and maintain observability (metrics, logs, traces, dashboards, alerts)
  • Partner with product/engineering to design reliable services; review architectures, failures, rollout strategies
  • Evolve AWS infrastructure with Terraform IaC
  • Contribute reliability code to libraries, tooling, and health checks
  • Define and iterate SLIs/SLOs with owners to guide releases

🎯 Requirements

  • 5+ years in SRE/DevOps or production infra
  • Proven lead on multi-sprint, multi-engineer projects with impact
  • Strong SRE practices: SLOs, toil reduction, safe deployments, post-incident reviews
  • Production code in Python or Node.js/TypeScript
  • Interest in AI-assisted tooling; validate and improve outputs
  • Observability skills with Datadog/Prometheus/Grafana/Honeycomb/New Relic

🎁 Benefits

  • Generous equity grant, become an owner
  • Macbook computer provided
  • Comprehensive benefits package
  • Flexible PTO and hybrid work schedules
  • Work from home stipend
  • Hubs in Los Angeles, San Francisco, Toronto, and Raleigh with hybrid days
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’