Related skills
deepspeed onnx runtime vllm tensorrt-llm llm inference optimizationπ Description
- Design and fine-tune inference pipelines for LLMs to maximize throughput and minimize latency.
- Apply quantization, pruning, speculative decoding, batching, and kernel fusion.
- Optimize inference-serving stacks (vLLM, TensorRT-LLM, ONNX Runtime) for production.
- Profile and tune GPU/accelerator utilization across the full inference stack.
- Design and optimize distributed training pipelines for large-scale models.
- Tune training efficiency with mixed-precision, checkpointing, and optimizer improvements.
π― Requirements
- Deep expertise in LLM inference optimization, including serving frameworks (vLLM, TensorRT-LLM, ONNX Runtime).
- Strong background in distributed AI training with PyTorch FSDP, DeepSpeed, Megatron-LM, or JAX/XLA.
- Experience packaging AI environments for reproducible deployment (containers, Apptainer/Singularity).
- Fluency with GPU profiling tools: NVIDIA Nsight, PyTorch Profiler, CUDA analysis.
- Familiarity with HPC environments: Slurm, PBS, RDMA/InfiniBand, MPI.
- Experience integrating AI workloads into CI/CD pipelines with automated testing and benchmarking.
- Comfort using LLM-based tools and agentic frameworks; strong analytical and communication skills.
π Benefits
- Medical, dental, and vision insurance.
- Flexible paid time off.
- Employee stock options.
- Remote work; no travel required for most positions.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!