Related skills
python kubernetes pytorch llm cuda📋 Description
- Port workloads on new platforms; ensure correctness and stability.
- Build benchmarks and stress tests for CPU/GPU/memory/storage/network.
- Deep-dive performance on distributed training/inference.
- NCCL/RCCL performance tuning; compute/communication overlaps.
- Create repeatable CI/lab test harnesses with actionable outputs.
- Collaborate with systems/fleet engineers for stability and scalability.
🎯 Requirements
- BS in CS/EE or equivalent practical experience
- 5+ years in ML systems, performance engineering, distributed systems, or HPC
- PyTorch and modern LLM training/inference stacks
- Large-scale distributed training concepts (data/model/pipeline parallel, collective comms)
- RDMA and debugging/optimizing comms libraries (NCCL or RCCL)
- Python; C++/CUDA/HIP; profiling with Nsight/rocprof/perf
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!