Related skills
cuda tensorrt gpu rocm transformer๐ Description
- Lead benchmarking and performance optimizations for inference engines and GPU kernels.
- Deep-dive into attention/memory/precision optimization and multi-node GPUs.
- Proactively implement cutting-edge optimization techniques for Gen AI workloads.
- Master GPU hardware and software stacks (CUDA, ROCm, TensorRT, Triton).
- Mentor through code and design reviews to raise the team's bar.
- Collaborate with Product/TPMs to translate hardware limits into features.
๐ฏ Requirements
- 5+ years in high-performance computing or AI infrastructure.
- Gen AI literacy across LLM, VLM, and LMM architectures.
- Optimization expert in attention layers and distributed GPU parallelism.
- Hardware fluency with NVIDIA/AMD GPUs; CUDA/ROCm ecosystems.
- Open source experience: build, integrate, contribute.
- Strong systems design for low-level GPU programming and memory access.
๐ Benefits
- Reimbursement for conferences, training, and education.
- LinkedIn Learning access to 10,000+ courses.
- EAP and local employee meetups; flexible time off.
- Salary range based on market data; potential bonus.
- Equity grants and Employee Stock Purchase Program.
- DigitalOcean is an equal-opportunity employer.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest โ finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!