Related skills
cuda tensorrt rocm transformer openai triton๐ Description
- Lead benchmarking and performance optimizations at inference engine and GPU kernels.
- Engineer solutions for memory bandwidth and compute bottlenecks across multi-node GPUs.
- Implement cutting-edge Gen AI optimization techniques to stay ahead.
- Partner with Product Management and TPMs to translate hardware limits into features.
- Maintain a strong presence in GPU performance and open-source communities.
๐ฏ Requirements
- 5+ years in HPC/AI infra solving compute and memory bottlenecks.
- Gen AI literacy across LLM/VLM/LMM and major model families.
- Hands-on attention-layer optimization and distributed GPU parallelization.
- Hardware fluency: NVIDIA/AMD GPUs and CUDA/ROCm ecosystems.
- Open source mastery: build/contribute to OSS projects.
- Systems design: low-level GPU programming, memory patterns, parallel execution.
๐ Benefits
- Career development with conferences, training, and education support.
- Global benefits: EAP, local meetups, flexible time off.
- LinkedIn Learning access for growth and skills.
- Equity and Employee Stock Purchase Program.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest โ finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!