Overview
CoreWeave is seeking an Operations Engineering Manager, Fleet Reliability to lead the reliability and performance of CoreWeave's GPU compute fleet across multiple data centers. You will own incident response, proactive reliability engineering, capacity planning, and the continuous improvement of deployment, monitoring, and on-call processes. The role requires strong leadership, collaboration with SRE, Platform, and Hardware teams, and hands-on technical depth in cloud infrastructure, Linux, scripting, and automation. Location options include Livingston, NJ; New York, NY; Plano, TX; Sunnyvale, CA; and Bellevue, WA. This is a full-time, on-site role with opportunities to work across the CoreWeave fleet.
Responsibilities
- Lead a team of engineers responsible for fleet reliability and incident response for CoreWeave's GPU compute fleet across multiple data centers
- Develop and implement reliability strategies, monitoring, alerting, capacity planning, and disaster recovery
- Collaborate with SRE, Platform, and Hardware teams to optimize deployment, upgrades, and maintenance of the fleet
- Champion automation and tooling to reduce MTTR and improve deployment velocity
- Manage on-call rotations, drive post-incident reviews, and implement process improvements
- Mentor and manage engineers, set goals, and participate in hiring efforts as needed
- Ensure security and compliance considerations are integrated into reliability decisions
Qualifications
- 5+ years of engineering leadership in a large-scale infrastructure environment
- Deep experience with Linux systems, virtualization, and container orchestration (Kubernetes, Docker)
- Strong background in cloud infrastructure, monitoring, incident management, and reliability engineering
- Familiarity with GPU compute hardware, drivers, and HPC environments
- Proficient in scripting and automation (Python, Bash)
- Excellent communication and collaboration skills
- Bachelor's degree in Computer Science, Engineering, or equivalent
Location
Livingston, NJ; New York, NY; Plano, TX; Sunnyvale, CA; Bellevue, WA