Added
less than a minute ago
Location
Type
Full time
Salary
Upgrade to Premium to se...
Related skills
sre python distributed systems observability mlπ Description
- Own reliability for Knowledge Work training environments.
- Ensure evaluation tools and model release processes are clean and reproducible.
- Build observability dashboards and tooling with high signal-to-noise metrics.
- Proactively harden environments via load testing and fault injection.
- Be primary contact for partner teams and drive incidents to resolution.
- Reduce researchers' operational burden to let them focus on research.
π― Requirements
- Highly experienced Python engineer shipping reliable, instrumented code.
- Experience operating ML or distributed systems at scale (on-call).
- SRE/production mindset: SLOs, load tests, and failure injection.
- Foundational ML knowledge to understand training/evaluation.
- Able to read research code and reason about evaluation integrity.
- 5+ years operating ML or distributed systems at scale.
π Benefits
- Competitive compensation and benefits.
- Optional equity donation matching.
- Generous vacation and parental leave.
- Flexible working hours.
- Lovely office space in San Francisco.
π Visa sponsorship
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!