Tired of Manually Applying to Jobs?

Let JobCopilot do it for you!

Set your preferences and let your AI copilot handle the job search while you sleep.

Applies for jobs that actually match your skills

Tailors your resume and cover letter automatically

Works 24/7—so you don't have to

Overview

CoreWeave is seeking an Operations Engineering Manager, Fleet Reliability to lead the reliability and performance of CoreWeave's GPU compute fleet across multiple data centers. You will own incident response, proactive reliability engineering, capacity planning, and the continuous improvement of deployment, monitoring, and on-call processes. The role requires strong leadership, collaboration with SRE, Platform, and Hardware teams, and hands-on technical depth in cloud infrastructure, Linux, scripting, and automation. Location options include Livingston, NJ; New York, NY; Plano, TX; Sunnyvale, CA; and Bellevue, WA. This is a full-time, on-site role with opportunities to work across the CoreWeave fleet.

Responsibilities

Lead a team of engineers responsible for fleet reliability and incident response for CoreWeave's GPU compute fleet across multiple data centers
Develop and implement reliability strategies, monitoring, alerting, capacity planning, and disaster recovery
Collaborate with SRE, Platform, and Hardware teams to optimize deployment, upgrades, and maintenance of the fleet
Champion automation and tooling to reduce MTTR and improve deployment velocity
Manage on-call rotations, drive post-incident reviews, and implement process improvements
Mentor and manage engineers, set goals, and participate in hiring efforts as needed
Ensure security and compliance considerations are integrated into reliability decisions

Qualifications

5+ years of engineering leadership in a large-scale infrastructure environment
Deep experience with Linux systems, virtualization, and container orchestration (Kubernetes, Docker)
Strong background in cloud infrastructure, monitoring, incident management, and reliability engineering
Familiarity with GPU compute hardware, drivers, and HPC environments
Proficient in scripting and automation (Python, Bash)
Excellent communication and collaboration skills
Bachelor's degree in Computer Science, Engineering, or equivalent

Location

Livingston, NJ; New York, NY; Plano, TX; Sunnyvale, CA; Bellevue, WA

CoreWeave

Tired of Manually Applying to Jobs?

Let JobCopilot do it for you!

Operations Engineering Manager, Fleet Reliability

Overview

Responsibilities

Qualifications

Location

Meet JobCopilot: Your Personal AI Job Hunter

Related Operations Jobs