Added
less than a minute ago
Location
Type
Full time
Salary
Upgrade to Premium to se...
Related skills
python cuda tensorrt tensorrt-llm ptqπ Description
- Optimize large-scale models (LLMs/VLMs) with quantization (PTQ/QAT) and LoRA/QLoRA.
- Architect model conversion/compilation pipelines using TensorRT for edge deployment.
- Parity, accuracy recovery, and latency benchmarking vs. PyTorch and edge binaries.
- Write and optimize CUDA kernels and TensorRT Plugins for high memory bandwidth and low latency.
- Produce production-grade, concurrent C++ and Python code for real-time edge inference.
π― Requirements
- Deep expertise in model quantization (PTQ, QAT) and mixed-precision inference (INT8, FP8, INT4, BF16/FP16).
- Proven experience optimizing large-scale models (LLMs, VLMs, or VLAs) using KV-cache, Speculative Decoding, and Efficient Attention (FlashAttention, Linear Attention).
- Extensive experience with model conversion/compilation pipelines (TensorRT, TensorRT-LLM) and benchmarking.
- Proficiency in low-level programming for AI accelerators: writing and optimizing custom CUDA kernels and TensorRT Plugins.
- Production-level C++ (14/17/20) and Python programming for concurrent, memory-safe, real-time edge inference.
π Benefits
- Zoox Stock Appreciation Rights (SARs) and Amazon RSUs
- Health, long-term care, disability, and life insurance
- Paid time off, vacation, and sick leave
- Competitive compensation and benefits package including health insurance and stock
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!