Software Engineer, Workload Enablement

Added
11 days ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

python kubernetes pytorch llm cuda

📋 Description

  • Port workloads on new platforms; ensure correctness and stability.
  • Build benchmarks and stress tests for CPU/GPU/memory/storage/network.
  • Deep-dive performance on distributed training/inference.
  • NCCL/RCCL performance tuning; compute/communication overlaps.
  • Create repeatable CI/lab test harnesses with actionable outputs.
  • Collaborate with systems/fleet engineers for stability and scalability.

🎯 Requirements

  • BS in CS/EE or equivalent practical experience
  • 5+ years in ML systems, performance engineering, distributed systems, or HPC
  • PyTorch and modern LLM training/inference stacks
  • Large-scale distributed training concepts (data/model/pipeline parallel, collective comms)
  • RDMA and debugging/optimizing comms libraries (NCCL or RCCL)
  • Python; C++/CUDA/HIP; profiling with Nsight/rocprof/perf
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs →