Related skills
python pytorch distributed systems data pipelines jax📋 Description
- Distributed training infrastructure across GPU clusters
- Experiment orchestration and tooling for launches and tracking
- Data pipeline engineering for training and evaluation
- Debugging and reliability across GPUs, networking, numerics
- Parallelism and systems research: data, tensor, pipeline, sequence
- Scaling infrastructure ahead of research to prevent bottlenecks
🎯 Requirements
- Deep experience building distributed training systems for large models
- Strong systems engineering across distributed systems, networking, storage
- Proficiency in Python and C++; PyTorch/JAX or equivalent at systems level
- Hands-on GPU profiling, memory optimization, compute efficiency
- Experience implementing parallelism strategies: data, tensor, pipeline, sequence
- PhD in CS/ML/Physics/Math or equivalent industry experience
🎁 Benefits
- Small, selective team; prototypes deployed quickly
- Own infrastructure across thousands of GPUs; compute not a constraint
- Environment rewards speed, autonomy, and technical depth
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!