Principal Site Reliability Engineer

Added
less than a minute ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

azure terraform aws grafana prometheus

πŸ“‹ Description

  • Hybrid role: 3 days/week in San Jose, CA or remote; reports to Production Engineering.
  • Drive automation-first culture with code to cut toil and build self-healing systems.
  • Design highly available, scalable infra across AWS, Azure, GCP, and bare-metal.
  • Implement observability with Prometheus, Grafana, OpenTelemetry; set SLIs/SLOs.
  • Lead Incident Commander duties; develop playbooks; post-incident analyses.
  • Partner with Engineering for operability reviews.

🎯 Requirements

  • 10+ years of reliability, scalability, and availability for large-scale production services.
  • Deep programming in Python, Go, or C/C++.
  • Strong networking, Linux/FreeBSD, and distributed architecture.
  • Experience in 24/7 on-call rotation and incident management.
  • ITIL framework experience; drive maturity via operability reviews.

🎁 Benefits

  • Various health plans
  • Time off for vacation and sick time
  • Parental leave options
  • Retirement options
  • Education reimbursement
  • In-office perks, and more!
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’