Added
4 days ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

sre helm grafana prometheus kubernetes

πŸ“‹ Description

  • Be the technical front door for post-sales accounts; triage Tier 1/2 issues; lead P0 war rooms.
  • Diagnose runtime issues: latency, memory, GPU usage, concurrency, and model lifecycle.
  • Debug infra across Kubernetes, networking, observability, and alerting.
  • Pull logs and traces; correlate signals in Grafana/Loki/Prometheus to root causes.
  • Lead outages, coordinate across Product/SRE/Infra; deliver root cause analyses after P0/P1.
  • Proactive account ownership for top enterprise accounts; set up monitoring and alerts; drive QBRs and expansion.
  • Coordinate SA to CE handoffs; maintain runbooks and diagnostic best practices; drive end to end projects.

🎯 Requirements

  • Deep Kubernetes troubleshooting with pod/resource debugging; Grafana/Loki/Prometheus.
  • Infra debugging across container orchestration, networking, service dependencies; production cluster experience.
  • Experience managing high-severity incidents; SLAs, war rooms, post incident reviews, exec comms.
  • Strong project management with ownership; run multiple complex, multi-stakeholder initiatives.
  • Translate recurring technical pain points into roadmap level insights and product improvements.
  • Strong communication and executive presence during high visibility situations.

🎁 Benefits

  • Meaningful equity
  • 100% coverage of medical, dental, and vision for you and dependents
  • Generous PTO including Winter Break
  • Paid parental leave
  • Company-facilitated 401(k)
  • Exposure to ML startups and learning opportunities
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’