Staff Developer – AI Evaluation & Reliability

Added
19 days ago
Type
Full time
Salary
Salary not provided

Related skills

aws sql python llms langchain

📋 Description

  • Own and evolve evaluation strategy for LLM- and agent-based systems (golden data, A/B tests).
  • Benchmark foundation model performance within Caseware’s domain; identify gaps and risks.
  • Lead RAG pipeline design: embeddings, retrieval strategies, reranking, quality metrics.
  • Design feedback/evaluation pipelines linking user behavior to improvements.
  • Define guardrails for agentic systems: schema validation, content filtering, tool governance.
  • Establish approval gates and rollout controls: feature flags, staged deployments, kill switches.

🎯 Requirements

  • Strong data science foundation with Python, SQL, statistics, and experiment design.
  • Deep hands-on experience with LLMs, prompting strategies, and agent reasoning patterns.
  • Practical expertise with embeddings, vector databases, retrieval metrics, and reranking approaches.
  • Proven experience designing or operating evaluation frameworks for generative AI or agentic systems ( automated and human-in-the-loop ).
  • Strong understanding of AI reliability, safety, and governance (guardrails, validation, monitoring, change control).
  • Nice-to-have: LangChain or similar.

🎁 Benefits

  • 100% remote work and strong work-life balance.
  • Competitive compensation with above-market benefits.
  • Prepaid medical plan.
  • Life insurance and funeral assistance.
  • Internet allowance and home office stipend.
  • Mentorship and budget for training.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs →