Added
1 hour ago
Type
Full time
Salary
Upgrade to Premium to se...
Related skills
linux grafana prometheus python slurmπ Description
- Regularly monitor the performance and health of InfiniBand fabrics, including switches, host adapters, and nodes.
- Investigate and resolve operational issues within InfiniBand fabrics, such as network connectivity problems and performance bottlenecks.
- Assist with the installation and operational bring-up of large InfiniBand fabrics in collaboration with onsite personnel and customer teams.
- Perform routine maintenance and upgrades on InfiniBand switches and control plane components.
- Collaborate with HPC cluster operations teams to provide troubleshooting and operational expertise.
π― Requirements
- At least 1 year of experience with InfiniBand or similar networking technologies.
- Solid understanding of networking concepts, including architectures, topologies, operational best practices, and troubleshooting.
- Experience with Linux system administration and maintenance.
- Proficiency in at least one scripting language.
- Hands-on experience with Nvidia UFM or similar fabric management tools.
π Benefits
- Medical, dental, and vision insurance - 100% paid by CoreWeave
- 401(k) with a generous employer match
- Flexible PTO
- Tuition Reimbursement
- Employee Stock Purchase Program (ESPP)
- Mental wellness benefits
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Operations Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!