Related skills
hpc gpu nvidia infiniband rdmaπ Description
- Manage end-to-end GPU ops from facilities to customers.
- Own deployment, commissioning, and production sign-off.
- Own performance, uptime, and SLA targets (99.9%+).
- Lead incident response, escalation, and root-cause resolution.
- Oversee hardware lifecycles and GPU scaling to 20k+ GPUs.
- Build and lead field ops, deployment, infra teams across sites.
π― Requirements
- 10+ years in large-scale infra or hyperscale data centers.
- 5+ years operating GPU-accelerated or HPC environments.
- Direct experience deploying 10,000+ GPUs in production.
- Deep expertise in RDMA networking (InfiniBand and RoCE).
- Proven ownership of 99.9%+ uptime and customer SLAs.
- Leadership of 20β50+ across field ops, deployment, infra.
π Benefits
- Medical, dental, vision, life insurance
- Short- and long-term disability coverage
- Paid time off
- Entrepreneurial culture and growth mindset
- Radical transparency and open collaboration
- Work at the cutting edge of AI infrastructure
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!