Related skills
cuda tensorrt rocm openai triton bf16📋 Description
- Lead benchmarking and performance optimizations for inference engines.
- Attention, memory, and precision optimizations; multi-node GPU parallelism.
- Apply cutting-edge Gen AI optimizations (FP8, BF16) and techniques.
- Collaborate with Product/TPMs to translate hardware limits into features.
- Mentor engineers via code/design reviews to raise the bar.
- Contribute to open-source AI and GPU performance communities.
🎯 Requirements
- Technical Depth: 5+ years in HPC or AI infrastructure.
- Gen AI literacy across LLM/VLM/LMM architectures.
- Attention-layer optimization and distributed GPU parallelism.
- Hardware fluency with NVIDIA/AMD GPUs and CUDA/ROCm ecosystems.
- Open source mastery: building with and contributing to OSS.
- Systems design for low-level GPU programming, memory access patterns.
🎁 Benefits
- We innovate with purpose—building for builders.
- Career development resources and LinkedIn Learning access.
- Well-being benefits and flexible time off for work-life balance.
- Competitive compensation and equity opportunities.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!