Added
less than a minute ago
Location
Type
Full time
Salary
Upgrade to Premium to se...
Related skills
linux python kubernetes machine learning distributed systemsπ Description
- Keep large-scale RL training runs moving by addressing urgent infra problems.
- Debug issues across training systems, inference, orchestration, and distributed infra.
- Solve hard problems at the research/engineering boundary: scale, reliability, latency, cost.
- Improve reliability and efficiency of RL training runs.
- Help researchers with infra-heavy integrations like multi-agent memory.
- Turn recurring operational issues into tools, systems, or abstractions.
π― Requirements
- Strong generalist engineer with ML infrastructure experience.
- Experience with RL, inference, scaling, or ML infrastructure.
- Learns quickly and operates across unfamiliar layers.
- Strong debugger with ownership, low ego, and good communication.
- Comfortable in messy, high-ownership environments with tight timelines.
- Thrives in fast-moving settings prioritizing reliability and speed.
π Benefits
- Experience supporting large-scale model training, async RL, or high-throughput ML infrastructure.
- Experience debugging distributed systems across GPUs, networking, orchestration.
- Background in performance optimization, scaling, or production-critical infra.
- Experience working directly with researchers or fast-moving model teams.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!