Research Engineer, Infrastructure

Added
5 hours ago
Type
Full time
Salary
Salary not provided

Related skills

python pytorch distributed systems data pipelines jax

📋 Description

  • Distributed training infrastructure across GPU clusters
  • Experiment orchestration and tooling for launches and tracking
  • Data pipeline engineering for training and evaluation
  • Debugging and reliability across GPUs, networking, numerics
  • Parallelism and systems research: data, tensor, pipeline, sequence
  • Scaling infrastructure ahead of research to prevent bottlenecks

🎯 Requirements

  • Deep experience building distributed training systems for large models
  • Strong systems engineering across distributed systems, networking, storage
  • Proficiency in Python and C++; PyTorch/JAX or equivalent at systems level
  • Hands-on GPU profiling, memory optimization, compute efficiency
  • Experience implementing parallelism strategies: data, tensor, pipeline, sequence
  • PhD in CS/ML/Physics/Math or equivalent industry experience

🎁 Benefits

  • Small, selective team; prototypes deployed quickly
  • Own infrastructure across thousands of GPUs; compute not a constraint
  • Environment rewards speed, autonomy, and technical depth
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs →