Technical Lead Manager - Training Runtime, Data(set) Movement

Added
8 minutes ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

rust python storage multimodal torch.utils.data

πŸ“‹ Description

  • Design and build a unified dataset read platform for multiple training frameworks.
  • Define dataset APIs, storage expectations, versioning, and migration paths.
  • Build reliability into the read path: stateful iteration, caching, restart.
  • Build terminal and web visualizers to inspect data late in the pipeline.
  • Write and review production code in data loading, caching, and reliability paths.
  • Partner with teams across frameworks, RL, multimodal models, and storage.

🎯 Requirements

  • Built data loading, datasets, storage, or distributed infra at scale (e.g. torch.utils.data)
  • Focus on API design, debugging ergonomics, performance, and bit-level correctness
  • Understand failure modes of large distributed training jobs and data systems
  • Experience with stateful iterators, checkpoint/restart semantics, caching
  • Comfortable with Python and lower-level systems code; Rust or C++ helpful
  • Worked with multimodal, video, RL, or pretraining data pipelines

🎁 Benefits

  • Equity
  • Hybrid work environment
  • Impactful work on large-scale AI systems and research
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’