Related skills
linux kubernetes pytorch slurm infinibandπ Description
- Remotely deploy and configure large-scale HPC clusters for AI workloads
- Remotely install and configure OS, firmware, software, and networking on HPC clusters
- Troubleshoot HPC cluster issues with on-site deployment teams
- Provide clear and detailed requirements to other teams for gaps and improvements
- Contribute to creation and maintenance of Standard Operating Procedures
- Mentor and assist less experienced team members
π― Requirements
- 5+ years deploying and configuring HPC clusters for AI workloads
- Deep understanding of HPC/AI architecture, OS, firmware, software, and networking
- Expertise with SFP+ fiber, Infiniband, and 100 GbE networks
- Experience with Ethernet, switching, GPU direct, RDMA, NCCL, Horovod
- Linux-based compute nodes, firmware updates, driver installation
- SLURM, Kubernetes, or other job scheduling systems
π Benefits
- Generous cash and equity compensation
- Health, dental, and vision coverage
- Wellness and commuter stipends
- 401k Plan with 2% company match (USA)
- Flexible paid time off plan
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!