Related skills
docker python kubernetes go deepspeedπ Description
- Drive GenAI Infrastructure Strategy: Propose, design, and lead the architecture of our next-generation LLM platform, significantly advancing our capabilities to support large-scale foundation models that serve millions of redditors.
- Design Resilient, Large-Scale Distributed Systems: Architect highly fault-tolerant training infrastructure capable of supporting multi-week, distributed workloads across massive GPU clusters. You will tackle challenges related to automated recovery, cluster-scale health monitoring, and advanced checkpointing to ensure optimal compute efficiency.
- Build Self-Serve LLM Workflows: Design and implement robust, production-grade pipelines for LLM fine-tuning (e.g., SFT, RLHF/DPO). You will abstract away the complexity of distributed training frameworks, integrating them into a seamless platform SDK that handles configuration, experiment tracking, and model lifecycle management.
- Develop Comprehensive Evaluation & Benchmarking Infrastructure: Treat model evaluation as a first-class platform capability. You will build scalable systems for automated regression detection, structured metrics tracking, and complex inference-heavy evaluation patterns to ensure the quality and safety of models before they hit production.
- Architect Advanced Data Ingestion Pipelines: Extend our distributed data platforms to natively and efficiently handle the massive, multimodal datasets (text, image, video) required for modern GenAI workloads, optimizing for throughput and dynamic batching.
- Provide Technical Leadership & Mentorship: Analyze complex bottlenecks in distributed systems to optimize for performance and cost-efficiency. Mentor senior engineers, champion a rigorous MLOps culture, and partner with cross-functional leadership to define technical roadmaps and de-risk major initiatives.
π― Requirements
- 10+ years of work experience in a production software development environment or building complex distributed data systems, plus a degree in ML, Engineering, Computer Science, or a related discipline.
- GenAI/LLM Infrastructure Expertise: Proven track record of designing and operating large-scale ML systems, specifically working with distributed training frameworks (e.g., FSDP, DeepSpeed, Megatron-LM) and LLM serving/inference optimization (e.g., vLLM, TensorRT-LLM).
- Distributed Systems Mastery: Hands-on experience managing fault-tolerant, petabyte-scale distributed systems and multi-node/multi-GPU training clusters.
- Advanced MLOps Knowledge: Deep understanding of modern ML orchestration, fine-tuning pipelines, and model evaluation methodologies. Experience with tools like Ray, MLflow, or similar ecosystem standards.
- GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it all within Kubernetes.
- Production Engineering Fundamentals: Hands-on experience with Kubernetes, Docker, and building production-quality, object-oriented code in Python and/or Go.
π Benefits
- Comprehensive Healthcare Benefits and Income Replacement Programs
- 401k with Employer Match
- Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support
- Family Planning Support
- Gender-Affirming Care
- Mental Health & Coaching Benefits
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!