This job is no longer available

The job listing you are looking has expired.
Please browse our latest remote jobs.

See open jobs →
← Back to all jobs

Operations Engineering Manager, Fleet Reliability

Added
26 days ago
Location
Type
Full time
Salary
Not Specified

Use AI to Automatically Apply!

Let your AI Job Copilot auto-fill application questions
Auto-apply to relevant jobs from 300,000 companies

Auto-apply with JobCopilot Apply manually instead
Save job

Overview

CoreWeave is seeking an Operations Engineering Manager, Fleet Reliability to lead the reliability and performance of CoreWeave's GPU compute fleet across multiple data centers. You will own incident response, proactive reliability engineering, capacity planning, and the continuous improvement of deployment, monitoring, and on-call processes. The role requires strong leadership, collaboration with SRE, Platform, and Hardware teams, and hands-on technical depth in cloud infrastructure, Linux, scripting, and automation. Location options include Livingston, NJ; New York, NY; Plano, TX; Sunnyvale, CA; and Bellevue, WA. This is a full-time, on-site role with opportunities to work across the CoreWeave fleet.

Responsibilities

  • Lead a team of engineers responsible for fleet reliability and incident response for CoreWeave's GPU compute fleet across multiple data centers
  • Develop and implement reliability strategies, monitoring, alerting, capacity planning, and disaster recovery
  • Collaborate with SRE, Platform, and Hardware teams to optimize deployment, upgrades, and maintenance of the fleet
  • Champion automation and tooling to reduce MTTR and improve deployment velocity
  • Manage on-call rotations, drive post-incident reviews, and implement process improvements
  • Mentor and manage engineers, set goals, and participate in hiring efforts as needed
  • Ensure security and compliance considerations are integrated into reliability decisions

Qualifications

  • 5+ years of engineering leadership in a large-scale infrastructure environment
  • Deep experience with Linux systems, virtualization, and container orchestration (Kubernetes, Docker)
  • Strong background in cloud infrastructure, monitoring, incident management, and reliability engineering
  • Familiarity with GPU compute hardware, drivers, and HPC environments
  • Proficient in scripting and automation (Python, Bash)
  • Excellent communication and collaboration skills
  • Bachelor's degree in Computer Science, Engineering, or equivalent

Location

Livingston, NJ; New York, NY; Plano, TX; Sunnyvale, CA; Bellevue, WA

Use AI to Automatically Apply!

Let your AI Job Copilot auto-fill application questions
Auto-apply to relevant jobs from 300,000 companies

Auto-apply with JobCopilot Apply manually instead
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to On site Operations Jobs. Just set your preferences and Job Copilot will do the rest—finding, filtering, and applying while you focus on what matters.

Related Operations Jobs

See more Operations jobs →