Research Engineer, RL Infrastructure and Reliability (Knowledge Work)

Added
less than a minute ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

sre python distributed systems observability ml

πŸ“‹ Description

  • Own reliability for Knowledge Work training environments.
  • Ensure evaluation tools and model release processes are clean and reproducible.
  • Build observability dashboards and tooling with high signal-to-noise metrics.
  • Proactively harden environments via load testing and fault injection.
  • Be primary contact for partner teams and drive incidents to resolution.
  • Reduce researchers' operational burden to let them focus on research.

🎯 Requirements

  • Highly experienced Python engineer shipping reliable, instrumented code.
  • Experience operating ML or distributed systems at scale (on-call).
  • SRE/production mindset: SLOs, load tests, and failure injection.
  • Foundational ML knowledge to understand training/evaluation.
  • Able to read research code and reason about evaluation integrity.
  • 5+ years operating ML or distributed systems at scale.

🎁 Benefits

  • Competitive compensation and benefits.
  • Optional equity donation matching.
  • Generous vacation and parental leave.
  • Flexible working hours.
  • Lovely office space in San Francisco.

πŸ›ƒ Visa sponsorship

Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’