Added
11 hours ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

linux kubernetes pytorch slurm infiniband

πŸ“‹ Description

  • Remotely deploy and configure large-scale HPC clusters for AI workloads
  • Remotely install and configure OS, firmware, software, and networking on HPC clusters
  • Troubleshoot HPC cluster issues with on-site deployment teams
  • Provide clear and detailed requirements to other teams for gaps and improvements
  • Contribute to creation and maintenance of Standard Operating Procedures
  • Mentor and assist less experienced team members

🎯 Requirements

  • 5+ years deploying and configuring HPC clusters for AI workloads
  • Deep understanding of HPC/AI architecture, OS, firmware, software, and networking
  • Expertise with SFP+ fiber, Infiniband, and 100 GbE networks
  • Experience with Ethernet, switching, GPU direct, RDMA, NCCL, Horovod
  • Linux-based compute nodes, firmware updates, driver installation
  • SLURM, Kubernetes, or other job scheduling systems

🎁 Benefits

  • Generous cash and equity compensation
  • Health, dental, and vision coverage
  • Wellness and commuter stipends
  • 401k Plan with 2% company match (USA)
  • Flexible paid time off plan
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’