Related skills
prompt engineering model performance transcripts claude code agentic evals📋 Description
- Own Claude Code launch planning and execution with cross-team coordination
- Design and implement agentic evals that measure real-world coding performance
- Drive the engineering team's eval roadmap
- Partner with researchers to define target behaviors and influence model development with evidence from real usage
- Talk with users and analyze transcripts to understand capability gaps and ship improvements
- Synthesize signals from users and benchmarks into clear priorities
🎯 Requirements
- Have personally built agentic evals (e.g. SWE-bench-style task suites)
- Are a Claude Code user and can articulate desired model behaviors
- Have an engineering background and 2+ years in product management, or equivalent experience
- Have a deep grasp of AI concepts and are comfortable with model behavior, prompt engineering, and evaluation methodology
- Are a systems thinker: you build infrastructure that prevents broad problems
- Bachelor’s degree or equivalent education, training, and experience
🎁 Benefits
- Competitive compensation and benefits
- Optional equity donation matching
- Generous vacation and parental leave
- Flexible working hours and a collaborative office environment
- HQ in San Francisco with a supportive office space
🛃 Visa sponsorship
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Product Jobs. Just set your
preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!