Tired of Manually Applying to Jobs?

Let JobCopilot do it for you!

Set your preferences and let your AI copilot handle the job search while you sleep.

Applies for jobs that actually match your skills

Tailors your resume and cover letter automatically

Works 24/7—so you don't have to

Overview

CoreWeave is seeking a Principal Engineer - Observability to lead the design and implementation of the company’s observability platform across distributed GPU-accelerated infrastructure. This role focuses on delivering robust visibility, reliability, and performance insights for services and systems at scale.

Responsibilities

Design, implement, and maintain the observability architecture across production systems.
Instrument services with metrics, traces, and logs using industry standards (Prometheus, OpenTelemetry, Jaeger).
Define and manage SLIs/SLOs, and build dashboards and alerting to enable proactive reliability.
Collaborate with Platform, DevOps, and Engineering teams to drive improvements in performance and reliability.
Mentor and guide junior engineers in best practices for observability and incident response.
Champion scalable instrumentation and data quality across multi-region deployments.

Qualifications

8+ years of software engineering or SRE experience, with a focus on observability and reliability.
Strong hands on experience with Prometheus, Grafana, OpenTelemetry, and distributed tracing (Jaeger or similar).
Deep knowledge of Kubernetes, Docker, and cloud environments (AWS, GCP, Azure).
Proficiency in one or more programming languages such as Go or Python.
Excellent problem solving, communication, and collaboration skills.

About CoreWeave

CoreWeave is a leading provider of GPU accelerated compute for AI, machine learning, and HPC workloads. We empower teams to run large scale workloads with speed and reliability.

What we offer

Competitive compensation and comprehensive benefits
Collaborative, fast paced engineering culture
Opportunities for professional growth and impact

CoreWeave