Senior Site Reliability Engineer

Related skills

datadog node.js terraform aws prometheus

πŸ“‹ Description

  • Drive and refine SRE across services: SLIs/SLOs, error budgets, reliability reviews.
  • Design end-to-end observability: metrics, logs, traces, dashboards, alerts.
  • Partner with product/engineering to design reliable services and rollout strategies.
  • Evolve and operate AWS infrastructure using Terraform IaC.
  • Contribute code and tooling for reliability libraries and health checks.
  • Define SLIs/SLOs with owners to guide reliability and release decisions.

🎯 Requirements

  • 5+ years in SRE/DevOps/Infra with production systems.
  • Led multi-sprint, multi-engineer reliability and infra initiatives with impact.
  • Expertise in SRE: SLIs/SLOs, error budgets, post-incident reviews.
  • Production-grade software in Python or Node.js/TypeScript.
  • Observability with Datadog, Prometheus, Grafana, Honeycomb, or New Relic.
  • AWS production experience; Terraform IaC; Docker/ECS/EKS/Kubernetes.

🎁 Benefits

  • Generous equity grant.
  • MacBook computer provided.
  • Comprehensive benefits package.
  • Flexible PTO and hybrid work schedules.
  • Work from home stipend.
  • Hubs in Los Angeles, San Francisco, Toronto, Raleigh with hybrid days.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’