Operations Engineer, HPC Networking

Related skills

linux grafana prometheus python slurm

πŸ“‹ Description

  • Regularly monitor the performance and health of InfiniBand fabrics, including switches, host adapters, and nodes.
  • Investigate and resolve operational issues within InfiniBand fabrics, such as network connectivity problems and performance bottlenecks.
  • Assist with the installation and operational bring-up of large InfiniBand fabrics in collaboration with onsite personnel and customer teams.
  • Perform routine maintenance and upgrades on InfiniBand switches and control plane components.
  • Collaborate with HPC cluster operations teams to provide troubleshooting and operational expertise.

🎯 Requirements

  • At least 1 year of experience with InfiniBand or similar networking technologies.
  • Solid understanding of networking concepts, including architectures, topologies, operational best practices, and troubleshooting.
  • Experience with Linux system administration and maintenance.
  • Proficiency in at least one scripting language.
  • Hands-on experience with Nvidia UFM or similar fabric management tools.

🎁 Benefits

  • Medical, dental, and vision insurance - 100% paid by CoreWeave
  • 401(k) with a generous employer match
  • Flexible PTO
  • Tuition Reimbursement
  • Employee Stock Purchase Program (ESPP)
  • Mental wellness benefits
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Operations Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Operations Jobs

See more Operations jobs β†’