Added
7 days ago
Type
Full time
Salary
Upgrade to Premium to se...
Related skills
ansible terraform helm linux prometheusπ Description
- Establish and lead the SRE/production engineering practice for Common Services.
- Define the reliability strategy, processes, and standards; drive operability across teams.
- Own incident management lifecycle, including on-call rotations and reviews.
- Develop an Operational Excellence plan focused on performance and toil reduction.
- Drive observability strategy (metrics, logs, traces, dashboards, alerts).
- Partner with engineering to design and review architectures for reliability and scalability.
π― Requirements
- 7+ years in SRE, production engineering, or similar roles on distributed systems.
- 2+ years of technical leadership (team lead, staff/principal, or people manager).
- Strong Linux prod environments, containers, and Kubernetes; debugging live systems.
- Hands-on observability stacks (metrics, logs, tracing) and alerting; design of SLI/SLOs.
- Proven on-call and incident response experience, including post-incident reviews.
- Experience with IaC and automation tools (Terraform, Ansible, Helm) to automate ops.
π Benefits
- Medical, dental, and vision insurance β 100% paid by CoreWeave
- Company-paid Life Insurance and voluntary life insurance
- Short and long-term disability insurance
- 401(k) with generous employer match
- Flexible PTO and paid parental leave
- Tuition Reimbursement and ESPP
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!