Added
24 hours ago
Type
Full time
Salary
Upgrade to Premium to se...
Related skills
ansible grafana prometheus python gpuπ Description
- Troubleshoot complex GPU and PCIe related failures
- Partner with external vendors on failure analysis
- Track component RMAs
- Develop and maintain hardware/firmware management services
- Automate server hardware lifecycle processes
- Serve as senior hardware escalation contact
π― Requirements
- 2+ years experience supporting/troubleshooting data center GPUs (H100+; Infiniband/NVLink)
- Proficiency in Ansible and Python; IPMI/Redfish (Redfish preferred)
- Experience with GPU diagnostics and observability tools (Prometheus, Grafana)
- In-depth knowledge of server hardware, GPUs and PCIe devices
- Experience collaborating with hardware vendors to create playbooks and drive resolution
- Excellent documentation and attention to detail
π Benefits
- Medical, dental, and vision insurance - 100% paid by CoreWeave
- Company-paid Life Insurance
- Short and long-term disability insurance
- Flexible Spending Account
- Health Savings Account
- Tuition Reimbursement
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!