Related skills
datadog docker terraform aws grafanaπ Description
- Drive SRE practices across services: SLIs/SLOs, error budgets, reliability reviews
- Design end-to-end observability with metrics, logs, traces, dashboards, alerts
- Collaborate with product/engineering to build reliable services and rollout strategies
- Evolve AWS infrastructure with Terraform IaC and automation
- Contribute code to reliability libraries, tooling, and health checks
- Participate in incident response and post-incident reviews
π― Requirements
- 2+ years in Site Reliability Engineering, DevOps, or Infra on production systems
- Strong SRE practices: SLIs/SLOs, error budgets, toil reduction
- Production coding experience in Python or Node.js/TypeScript
- Experience with AWS, Terraform-based IaC, and containers (Docker, ECS, EKS, Kubernetes)
- Observability tooling: Datadog, Prometheus, Grafana, Honeycomb, or New Relic
- Incident management experience is a strong plus; post-incident follow-ups
π Benefits
- Generous equity grant; own part of the company
- MacBook provided
- Comprehensive benefits package
- Flexible PTO and hybrid work schedules
- Work-from-home stipend
- Hubs in LA, SF, Toronto, and Raleigh with hybrid days and lunch
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!