AI Platform Engineer – Training & Inference

Added
5 days ago
Type
Full time
Salary
Salary not provided

Related skills

pytorch tensorrt ray gke nvidia triton

📋 Description

  • Own Ray ecosystem end-to-end on GKE
  • Operate Ray Train on multi-node H100 clusters
  • Build LLM inference mesh with Ray Serve
  • Optimize inference: fractional GPUs, batching, autoscaling
  • Design model routing layer for multi-tenant LLMs
  • Build RL training infra with Flyte and RLlib

🎯 Requirements

  • Experience in ML engineering with ML platform or MLOps
  • Production Ray depth: Train, Serve, Core, Data
  • LLM serving engines: vLLM, SGLang, NVIDIA Triton
  • Distributed training: DDP, FSDP, NCCL, mixed precision BF16/FP8
  • RL knowledge: PPO, policy gradient, RLHF
  • Model lifecycle ops: MLflow registry, shadow/A/B/canary, auto rollback
  • Vector databases: Pgvector or Qdrant
  • Python and PyTorch; Flyte or equivalent ML orchestrator

🎁 Benefits

  • Competitive total rewards package
  • Opportunities for growth and advancement
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs →