Own reliability, availability, and performance of production systems; define SLOs.
Design and build autonomous AI agents for monitoring distributed systems.
Lead on-call incidents; drive blameless postmortems and incident response.
Build automation and Infrastructure-as-Code with Terraform.
Improve observability with Datadog, Prometheus, Grafana, OpenTelemetry.
Collaborate with teams and mentor junior engineers.

🎯 Requirements

5+ years in SRE/DevOps/Platform Engineering.
Cloud experience with Azure (preferred), AWS, or GCP.
Strong Linux, networking, and distributed systems troubleshooting.
Kubernetes/EKS/AKS containers orchestration; Terraform IaC.
Python, Go, Bash, or C#/.NET scripting.
Observability with Datadog, Prometheus, Grafana, or OpenTelemetry.

🎁 Benefits

Flexible vacation
12 company-paid holidays
10 paid sick days
Health, dental, and vision insurance
401(k) with matching
Equity

Apply on employer's website

This employer gathers applications via their own applicant tracking system.

You will be redirected to an external application form.

Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Activate JobCopilot