Related skills
ansible grafana prometheus python infiniband๐ Description
- Troubleshoot GPU and PCIe failures.
- Partner with vendors on failure analysis.
- Track component RMAs.
- Automate server hardware lifecycle.
- Lead hardware escalation and troubleshooting.
- Define hardware requirements, specs and playbooks.
๐ฏ Requirements
- 2+ years of data center GPU troubleshooting (H100+; Infiniband; NVLink).
- Proficiency in Ansible and Python; IPMI or Redfish (Redfish preferred).
- Experience automating GPU diagnostics with Prometheus & Grafana.
- Deep knowledge of GPUs, PCIe and server hardware.
- Collaborated with hardware vendors to create alerts and playbooks.
- Strong automation mindset with thorough documentation.
๐ Benefits
- Medical, dental, and vision insurance paid.
- Company-paid life and disability insurance.
- 401(k) with employer match; ESPP options.
- Tuition reimbursement; flexible PTO.
- Wellness programs and HSA/FSA.
- Hybrid work environment with remote options.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest โ finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!