Reliability Lead, Common Services

Added
7 days ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

ansible terraform helm linux prometheus

πŸ“‹ Description

  • Establish and lead the SRE/production engineering practice for Common Services.
  • Define the reliability strategy, processes, and standards; drive operability across teams.
  • Own incident management lifecycle, including on-call rotations and reviews.
  • Develop an Operational Excellence plan focused on performance and toil reduction.
  • Drive observability strategy (metrics, logs, traces, dashboards, alerts).
  • Partner with engineering to design and review architectures for reliability and scalability.

🎯 Requirements

  • 7+ years in SRE, production engineering, or similar roles on distributed systems.
  • 2+ years of technical leadership (team lead, staff/principal, or people manager).
  • Strong Linux prod environments, containers, and Kubernetes; debugging live systems.
  • Hands-on observability stacks (metrics, logs, tracing) and alerting; design of SLI/SLOs.
  • Proven on-call and incident response experience, including post-incident reviews.
  • Experience with IaC and automation tools (Terraform, Ansible, Helm) to automate ops.

🎁 Benefits

  • Medical, dental, and vision insurance β€” 100% paid by CoreWeave
  • Company-paid Life Insurance and voluntary life insurance
  • Short and long-term disability insurance
  • 401(k) with generous employer match
  • Flexible PTO and paid parental leave
  • Tuition Reimbursement and ESPP
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’