Related skills
python git pytest evaluation llm📋 Description
- Design coding benchmarks for frontier models on real-world programming tasks
- Build and maintain scalable evaluation data pipelines
- Analyze model-generated code for correctness and edge cases
- Create structured evaluation scenarios across large repos and multi-language environments
- Provide detailed feedback on model performance and failure patterns
- Contribute to evaluation frameworks for coding benchmarks
🎯 Requirements
- 4+ years of professional software engineering experience
- Expert Python — clean, performant, well-tested code
- Hands-on experience in large, complex codebases
- Proven experience designing LLM coding benchmarks and data pipelines
- Strong Git skills and modern development workflows; strong written English
- Track record at a high-growth tech company or top-tier software org
🎁 Benefits
- Fully remote — work from anywhere within accepted locations list
- Contract length: 3 months, with potential extension
- Hours vary; full-time availability preferred
- Engagement: 1099 independent contractor
- Weekly payment via PayPal or Stripe
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!