AI Inference Engineer - Model Optimization & Deployment

Added
less than a minute ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

python cuda tensorrt tensorrt-llm ptq

πŸ“‹ Description

  • Optimize large-scale models (LLMs/VLMs) with quantization (PTQ/QAT) and LoRA/QLoRA.
  • Architect model conversion/compilation pipelines using TensorRT for edge deployment.
  • Parity, accuracy recovery, and latency benchmarking vs. PyTorch and edge binaries.
  • Write and optimize CUDA kernels and TensorRT Plugins for high memory bandwidth and low latency.
  • Produce production-grade, concurrent C++ and Python code for real-time edge inference.

🎯 Requirements

  • Deep expertise in model quantization (PTQ, QAT) and mixed-precision inference (INT8, FP8, INT4, BF16/FP16).
  • Proven experience optimizing large-scale models (LLMs, VLMs, or VLAs) using KV-cache, Speculative Decoding, and Efficient Attention (FlashAttention, Linear Attention).
  • Extensive experience with model conversion/compilation pipelines (TensorRT, TensorRT-LLM) and benchmarking.
  • Proficiency in low-level programming for AI accelerators: writing and optimizing custom CUDA kernels and TensorRT Plugins.
  • Production-level C++ (14/17/20) and Python programming for concurrent, memory-safe, real-time edge inference.

🎁 Benefits

  • Zoox Stock Appreciation Rights (SARs) and Amazon RSUs
  • Health, long-term care, disability, and life insurance
  • Paid time off, vacation, and sick leave
  • Competitive compensation and benefits package including health insurance and stock
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’