Related skills
aws sql python llms langchain📋 Description
- Own and evolve evaluation strategy for LLM- and agent-based systems (golden data, A/B tests).
- Benchmark foundation model performance within Caseware’s domain; identify gaps and risks.
- Lead RAG pipeline design: embeddings, retrieval strategies, reranking, quality metrics.
- Design feedback/evaluation pipelines linking user behavior to improvements.
- Define guardrails for agentic systems: schema validation, content filtering, tool governance.
- Establish approval gates and rollout controls: feature flags, staged deployments, kill switches.
🎯 Requirements
- Strong data science foundation with Python, SQL, statistics, and experiment design.
- Deep hands-on experience with LLMs, prompting strategies, and agent reasoning patterns.
- Practical expertise with embeddings, vector databases, retrieval metrics, and reranking approaches.
- Proven experience designing or operating evaluation frameworks for generative AI or agentic systems ( automated and human-in-the-loop ).
- Strong understanding of AI reliability, safety, and governance (guardrails, validation, monitoring, change control).
- Nice-to-have: LangChain or similar.
🎁 Benefits
- 100% remote work and strong work-life balance.
- Competitive compensation with above-market benefits.
- Prepaid medical plan.
- Life insurance and funeral assistance.
- Internet allowance and home office stipend.
- Mentorship and budget for training.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!