Related skills
python evaluation llm agents datasetsπ Description
- Design and run evaluations of new AI capabilities
- Compare frontier models, agent systems, and tool workflows
- Turn emerging ideas into measurable benchmarks
- Define datasets, tasks, and scoring logic for experiments
- Design realistic workloads that reflect production environments
- Create tests that expose failure modes and edge cases
π― Requirements
- Built or contributed to evaluation systems for LLM or agent applications
- Designed experiments comparing models, prompts, or AI architectures
- Written Python code to run tests across models or APIs
- Built datasets or scoring logic for AI quality measurement
- Investigated model failures or unexpected behaviors
- Published technical blog posts, research notes, or engineering write-ups
π Benefits
- Medical, dental, and vision insurance
- Daily lunch, snacks, and beverages
- Flexible time off
- Competitive salary and equity
- AI Stipend
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!