Own enterprise-scale evaluation infrastructure for AI agents.
Build unified evaluation platform serving as single source of truth for workflows.
Develop observability to surface agent behavior and failures.
Integrate LLMs, tools, retrieval, and logic into reliable agent experiences.
Create automated pipelines to evaluate models within hours of release.

🎯 Requirements

Multiple years shipping production software in complex systems.
TypeScript, React, Python, and PostgreSQL.
Built and deployed LLM-powered features in production.
Designed evaluation frameworks for model outputs and agent behaviors.
Worked with vector databases, embeddings, and RAG architectures.
Experience with evaluation platforms (LangSmith, Langfuse, or similar).

🎁 Benefits

Competitive compensation with meaningful ownership.
Flexible PTO.
401k.
Wellness benefits, including therapy sessions.
Technology & Work from Home reimbursement.
Flexible work schedules.

Apply on employer's website

This employer gathers applications via their own applicant tracking system.

You will be redirected to an external application form.

Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Activate JobCopilot