Added
1 day ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

datadog cloud pagerduty networking gpu_compute_clusters

πŸ“‹ Description

  • Lead responses to critical SEV-1/SEV-2 incidents impacting AI infra and data centers.
  • Serve as Incident Commander during major outages, coordinating cross-team efforts.
  • Act as liaison between leadership and external teams to provide updates.
  • Establish incident timelines, triage actions, and resolution plans.
  • Own the incident lifecycle including triage, escalation, coordination, PIRs.
  • Ensure timely, accurate communication with internal stakeholders and leadership.

🎯 Requirements

  • 8+ years in incident management, SRE, or infrastructure ops.
  • Experience leading incidents in large-scale distributed infra.
  • Strong data center ops, GPU clusters, networking & storage, cloud/hybrid.
  • Proven ability to lead high-pressure incident responses.
  • Experience with ITIL, SRE, or equivalent frameworks.
  • Experience with PagerDuty, ServiceNow, Jira, Datadog, Prometheus/Grafana.

🎁 Benefits

  • Generous cash and equity compensation.
  • Health, dental, and vision coverage for you and dependents.
  • Wellness and commuter stipends for select roles.
  • 401k plan with 2% company match (USA).
  • Flexible paid time off.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Operations Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Operations Jobs

See more Operations jobs β†’