Related skills
cuda tensorrt rocm transformer moe๐ Description
- Lead benchmarking and perf optimizations for inference engines.
- Engineer solutions for memory bandwidth and compute bottlenecks.
- Implement cutting-edge optimization techniques to lead Gen AI landscape.
- Improve batch size performance; tune AITER CK/ASK for FP8/BF16.
- Identify kernel fusion opportunities for GLM-5 in Transformer blocks.
- Tune gateway router kernels for MoE models like Qwen3-235B.
๐ฏ Requirements
- 5+ years in HPC or AI infra solving compute and memory bottlenecks.
- Gen AI literacy across LLM/VLM/LMM landscapes.
- Optimization expert: attention layers and distributed GPU parallelism.
- Hardware fluency with NVIDIA/AMD GPUs and CUDA/ROCm.
- Open source mastery; contributing to OSS projects.
- Systems design: low-level GPU programming and memory patterns.
๐ Benefits
- We innovate with purpose and ship impactful AI tech.
- Career development resources including conferences and courses.
- Well-being support: EAP, local meetups, flexible time off.
- Equal opportunity employer; inclusive, diverse culture.
- Global remote-friendly culture with ownership and accountability.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest โ finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!