Hybrid role: 3 days/week in San Jose, CA or remote; reports to Production Engineering.
Drive automation-first culture with code to cut toil and build self-healing systems.
Design highly available, scalable infra across AWS, Azure, GCP, and bare-metal.
Implement observability with Prometheus, Grafana, OpenTelemetry; set SLIs/SLOs.
Lead Incident Commander duties; develop playbooks; post-incident analyses.
Partner with Engineering for operability reviews.

🎯 Requirements

10+ years of reliability, scalability, and availability for large-scale production services.
Deep programming in Python, Go, or C/C++.
Strong networking, Linux/FreeBSD, and distributed architecture.
Experience in 24/7 on-call rotation and incident management.
ITIL framework experience; drive maturity via operability reviews.

🎁 Benefits

Various health plans
Time off for vacation and sick time
Parental leave options
Retirement options
Education reimbursement
In-office perks, and more!

Apply on employer's website

This employer gathers applications via their own applicant tracking system.

You will be redirected to an external application form.

Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Activate JobCopilot