Software Engineer, RL Training Infra

Added
less than a minute ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

linux python kubernetes machine learning distributed systems

πŸ“‹ Description

  • Keep large-scale RL training runs moving by addressing urgent infra problems.
  • Debug issues across training systems, inference, orchestration, and distributed infra.
  • Solve hard problems at the research/engineering boundary: scale, reliability, latency, cost.
  • Improve reliability and efficiency of RL training runs.
  • Help researchers with infra-heavy integrations like multi-agent memory.
  • Turn recurring operational issues into tools, systems, or abstractions.

🎯 Requirements

  • Strong generalist engineer with ML infrastructure experience.
  • Experience with RL, inference, scaling, or ML infrastructure.
  • Learns quickly and operates across unfamiliar layers.
  • Strong debugger with ownership, low ego, and good communication.
  • Comfortable in messy, high-ownership environments with tight timelines.
  • Thrives in fast-moving settings prioritizing reliability and speed.

🎁 Benefits

  • Experience supporting large-scale model training, async RL, or high-throughput ML infrastructure.
  • Experience debugging distributed systems across GPUs, networking, orchestration.
  • Background in performance optimization, scaling, or production-critical infra.
  • Experience working directly with researchers or fast-moving model teams.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’