Related skills
datadog cloud pagerduty networking gpu_compute_clustersπ Description
- Lead responses to critical SEV-1/SEV-2 incidents impacting AI infra and data centers.
- Serve as Incident Commander during major outages, coordinating cross-team efforts.
- Act as liaison between leadership and external teams to provide updates.
- Establish incident timelines, triage actions, and resolution plans.
- Own the incident lifecycle including triage, escalation, coordination, PIRs.
- Ensure timely, accurate communication with internal stakeholders and leadership.
π― Requirements
- 8+ years in incident management, SRE, or infrastructure ops.
- Experience leading incidents in large-scale distributed infra.
- Strong data center ops, GPU clusters, networking & storage, cloud/hybrid.
- Proven ability to lead high-pressure incident responses.
- Experience with ITIL, SRE, or equivalent frameworks.
- Experience with PagerDuty, ServiceNow, Jira, Datadog, Prometheus/Grafana.
π Benefits
- Generous cash and equity compensation.
- Health, dental, and vision coverage for you and dependents.
- Wellness and commuter stipends for select roles.
- 401k plan with 2% company match (USA).
- Flexible paid time off.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Operations Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!