Related skills
python typescript evaluation datasets benchmarksπ Description
- Design evaluation frameworks to measure AI accuracy, reliability, and edge cases.
- Create and curate high-quality datasets, golden tests, and benchmarks from prod data.
- Build automated test harnesses and metrics pipelines to evaluate models and prompts.
- Partner with applied AI engineers and product leaders to define measurable criteria.
- Own the evaluation lifecycle for major AI initiatives from experimentation to production.
π― Requirements
- Minimum 5+ years of professional experience in CS, ML, or related field.
- Experience building testing, evaluation, or data infrastructure for AI/ML systems.
- Comfort writing production-quality code in Python and TypeScript.
- Experience with structured and unstructured data, labeling workflows, or data pipelines.
- Familiarity with ML evaluation techniques (offline/online metrics, regression testing).
- Bonus: experience evaluating LLMs, agentic systems, or AI-assisted developer tools.
π Benefits
- Base salary: $240kβ$280k USD per year.
- Equity included.
- Eligible for benefits and health insurance.
- See Sentry Benefits page for details.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!