Related skills
machine learning llm ppo dpo sftπ Description
- Design and run post-training pipelines (SFT, GRPO, DPO, RLVR)
- Build task-specific training environments and evals for healthcare, code, and legal
- Translate production data into training signals; design reward loops
- Run end-to-end training experiments; diagnose reward hacking and drift
- Publish findings and contribute to Baseten's open-source training libraries
π― Requirements
- Hands-on LLM training with reinforcement learning (GRPO/PPO)
- Strong reward engineering intuition; distinguish effective vs exploitable rewards
- Experience building multi-turn agent environments with tool use
- Comfort with end-to-end ML pipeline from data to deployment
- Experience with production ML systems; prefer closed-loop production data
- Experience with RL training frameworks
- Publications at NeurIPS/ICML/ICLR on RL for LLMs, reward modeling, or alignment
π Benefits
- Competitive pay with meaningful equity
- 100% medical, dental, and vision for you and dependents
- Generous PTO including Winter Break
- Paid parental leave
- Company-facilitated 401(k)
- Exposure to ML startups for learning/networking
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!