Research Infrastructure Engineer, Training Systems

Added
18 minutes ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

networking python pytorch distributed systems apis

πŸ“‹ Description

  • Build and maintain infrastructure for large-scale model training and experimentation.
  • Design APIs and interfaces that make complex training workflows easier to express and harder to misuse.
  • Improve reliability, debuggability, and performance across training and data pipelines.
  • Debug issues spanning Python, PyTorch, distributed systems, GPUs, networking, and storage.
  • Write tests, benchmarks, and diagnostics that catch meaningful regressions.

🎯 Requirements

  • Strong systems instincts with focus on performance, reliability, and clean abstractions.
  • Comfortable working across ML research code and production infrastructure.
  • Good taste in API and interface design with empathy for researchers.
  • Debug across Python, PyTorch, distributed systems, GPUs, networking.
  • Write tests, benchmarks, and diagnostics to catch regressions.
  • Proficient in Python and PyTorch for ML workflows.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’