Overview
CoreWeave is seeking a Principal Engineer - Observability to lead the design and implementation of the company’s observability platform across distributed GPU-accelerated infrastructure. This role focuses on delivering robust visibility, reliability, and performance insights for services and systems at scale.
Responsibilities
- Design, implement, and maintain the observability architecture across production systems.
- Instrument services with metrics, traces, and logs using industry standards (Prometheus, OpenTelemetry, Jaeger).
- Define and manage SLIs/SLOs, and build dashboards and alerting to enable proactive reliability.
- Collaborate with Platform, DevOps, and Engineering teams to drive improvements in performance and reliability.
- Mentor and guide junior engineers in best practices for observability and incident response.
- Champion scalable instrumentation and data quality across multi-region deployments.
Qualifications
- 8+ years of software engineering or SRE experience, with a focus on observability and reliability.
- Strong hands on experience with Prometheus, Grafana, OpenTelemetry, and distributed tracing (Jaeger or similar).
- Deep knowledge of Kubernetes, Docker, and cloud environments (AWS, GCP, Azure).
- Proficiency in one or more programming languages such as Go or Python.
- Excellent problem solving, communication, and collaboration skills.
About CoreWeave
CoreWeave is a leading provider of GPU accelerated compute for AI, machine learning, and HPC workloads. We empower teams to run large scale workloads with speed and reliability.
What we offer
- Competitive compensation and comprehensive benefits
- Collaborative, fast paced engineering culture
- Opportunities for professional growth and impact