Related skills
sql python statistics observability experimentation📋 Description
- Design evaluation frameworks for real-world task success in agentic systems
- Build benchmarking pipelines for trust calibration and handoff quality
- Lead observability tooling for analyzing production agent behavior
- Translate signals into actionable insights for model iteration
- Collaborate with researchers, engineers, and product to align evaluation goals
- Own scalable benchmarking infrastructure across AI initiatives
🎯 Requirements
- Experience designing evaluation systems for agentic or LLM-based AI
- Expertise in statistical experiments and benchmark creation
- Fluency with Python, SQL and distributed data processing
- Build data pipelines and tooling for analysis and observability
- Ability to influence product and model roadmaps with evaluation insights
🎁 Benefits
- Annual base salary: $195,000 – $308,000 USD
- Eligible for annual bonus or incentive plan
- Comprehensive medical coverage for you and family
- Unlimited PTO
- 401(k) with matching
- 12 weeks paid parental leave
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!