G2i

⏰ 4-day week company

11-50 employees
17 jobs posted

View company profile →

Please mention that you found this job on empllo.com. Thanks & good luck!

Tired of Manually Applying to Jobs?

Let JobCopilot do it for you!

Set your preferences and let your AI copilot handle the job search while you sleep.

Applies for jobs that actually match your skills
Tailors your resume and cover letter automatically
Works 24/7—so you don't have to

Activate JobCopilot

Follow us on LinkedIn!

Senior Software Engineer — AI Evaluation & Benchmarks (Python)

Added

27 days ago

Location

🌍 North America

Type

Contract

Salary

Salary not provided

Related skills

python git pytest evaluation llm

📋 Description

Design coding benchmarks for frontier models on real-world programming tasks
Build and maintain scalable evaluation data pipelines
Analyze model-generated code for correctness and edge cases
Create structured evaluation scenarios across large repos and multi-language environments
Provide detailed feedback on model performance and failure patterns
Contribute to evaluation frameworks for coding benchmarks

🎯 Requirements

4+ years of professional software engineering experience
Expert Python — clean, performant, well-tested code
Hands-on experience in large, complex codebases
Proven experience designing LLM coding benchmarks and data pipelines
Strong Git skills and modern development workflows; strong written English
Track record at a high-growth tech company or top-tier software org

🎁 Benefits

Fully remote — work from anywhere within accepted locations list
Contract length: 3 months, with potential extension
Hours vary; full-time availability preferred
Engagement: 1099 independent contractor
Weekly payment via PayPal or Stripe

Apply on employer's website

This employer gathers applications via their own applicant tracking system.

You will be redirected to an external application form.

Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Activate JobCopilot