Related skills
infiniband rdma nccl tensorrt-llm roce v2📋 Description
- Make RDMA First-Class: integrate RDMA/RoCE/InfiniBand into inference stack.
- Optimize Distributed Inference: tune networking for KV Cache Offload and WideEP.
- Enable serverless-grade startup for trillion-parameter models.
- Deep-dive into hardware: validate networking on H100/H200/NVL72 clusters.
- Build Observability: visualize packet flow, congestion, and bandwidth.
- Optimize Kernels: work with NCCL/NVSHMEM and custom kernels.
🎯 Requirements
- Deep experience with high-performance networking protocols (InfiniBand, RoCE v2).
- Fluent in C++ or Python; memory hierarchy for H100/Blackwell.
- Bridge software and hardware; debug NVLink topology.
- Know when to use off-the-shelf vs custom solutions for performance.
- Knowledge of NCCL, NVSHMEM, UCX for GPU interconnects.
- Familiar with TensorRT-LLM, vLLM, or Sglang.
- Experience running low-level benchmarks to qualify new hardware clusters.
🎁 Benefits
- Competitive compensation with equity.
- Medical, dental, and vision coverage for you and dependents.
- Generous PTO including Winter Break.
- Paid parental leave.
- Company 401(k) program.
- Exposure to ML startups for learning and networking.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!