📋 Description

Drive GenAI Infrastructure Strategy: Propose, design, and lead the architecture of our next-generation LLM platform, significantly advancing our capabilities to support large-scale foundation models that serve millions of redditors.
Design Resilient, Large-Scale Distributed Systems: Architect highly fault-tolerant training infrastructure capable of supporting multi-week, distributed workloads across massive GPU clusters. You will tackle challenges related to automated recovery, cluster-scale health monitoring, and advanced checkpointing to ensure optimal compute efficiency.
Build Self-Serve LLM Workflows: Design and implement robust, production-grade pipelines for LLM fine-tuning (e.g., SFT, RLHF/DPO). You will abstract away the complexity of distributed training frameworks, integrating them into a seamless platform SDK that handles configuration, experiment tracking, and model lifecycle management.
Develop Comprehensive Evaluation & Benchmarking Infrastructure: Treat model evaluation as a first-class platform capability. You will build scalable systems for automated regression detection, structured metrics tracking, and complex inference-heavy evaluation patterns to ensure the quality and safety of models before they hit production.
Architect Advanced Data Ingestion Pipelines: Extend our distributed data platforms to natively and efficiently handle the massive, multimodal datasets (text, image, video) required for modern GenAI workloads, optimizing for throughput and dynamic batching.
Provide Technical Leadership & Mentorship: Analyze complex bottlenecks in distributed systems to optimize for performance and cost-efficiency. Mentor senior engineers, champion a rigorous MLOps culture, and partner with cross-functional leadership to define technical roadmaps and de-risk major initiatives.

🎯 Requirements

10+ years of work experience in a production software development environment or building complex distributed data systems, plus a degree in ML, Engineering, Computer Science, or a related discipline.
GenAI/LLM Infrastructure Expertise: Proven track record of designing and operating large-scale ML systems, specifically working with distributed training frameworks (e.g., FSDP, DeepSpeed, Megatron-LM) and LLM serving/inference optimization (e.g., vLLM, TensorRT-LLM).
Distributed Systems Mastery: Hands-on experience managing fault-tolerant, petabyte-scale distributed systems and multi-node/multi-GPU training clusters.
Advanced MLOps Knowledge: Deep understanding of modern ML orchestration, fine-tuning pipelines, and model evaluation methodologies. Experience with tools like Ray, MLflow, or similar ecosystem standards.
GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it all within Kubernetes.
Production Engineering Fundamentals: Hands-on experience with Kubernetes, Docker, and building production-quality, object-oriented code in Python and/or Go.

🎁 Benefits

Comprehensive Healthcare Benefits and Income Replacement Programs
401k with Employer Match
Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support
Family Planning Support
Gender-Affirming Care
Mental Health & Coaching Benefits

Staff Machine Learning Engineer, GenAI Platform

📋 Description

🎯 Requirements

🎁 Benefits

Meet JobCopilot: Your Personal AI Job Hunter