Related skills
sql qa prompt engineering playwright ragπ Description
- Design/test strategies for LLM features, prompts, and hallucination detection
- Build automated evaluation pipelines (eval sets, golden data, LLM-as-judge) to catch quality regressions
- Black-box and exploratory testing of AI features on web/mobile for accuracy, safety, edge cases
- Define AI quality metrics and release readiness thresholds
- Collaborate with engineers, PMs, ML/AI engineers, and clinicians to define what good AI looks like
- Investigate and triage AI failure modes: model, prompt, retrieval, integration issues
π― Requirements
- 5+ years software QA with at least 1 year testing LLM/AI features
- Strong QA principles, test case creation/documentation for deterministic and non-deterministic systems
- Hands-on LLM tooling: prompt engineering, RAG, eval frameworks, LLM APIs (OpenAI, Anthropic)
- Experience designing automated qualitative evaluation (LLM-as-judge, rubric scoring, golden datasets)
- Proficiency with test automation tools, emphasis on Playwright
- Strong SQL for data validation and test data creation
π Benefits
- Medical, Dental, Vision Coverage, with option to extend to your dependents
- Company-sponsored short-term insurance
- Fully-paid 8 week parental leave after 6 months of employment
- Company-sponsored 401k after 3 months of employment
- Unlimited vacation for salaried roles
- Bi-annual company offsites and work from home stipend
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!