Related skills
tensorflow pytorch transformers jax data parallelismπ Description
- Build in-house tooling to support post-training models
- Work across the stack: Kubernetes, storage, networking
- Leverage PyTorch distributed tensor computation and GPU kernels
- Train a wide spectrum of model architectures at scale
- Collaborate with researchers to define specs and requirements
- Address systems-level ML infra and tooling challenges
π― Requirements
- Deep understanding of modern ML techniques for training transformers
- Advanced experience with PyTorch, TensorFlow, or Jax
- Knowledge of transformer training parallelism: data, tensor, pipeline
- Ability to profile and optimize distributed GPU programs
- Familiarity with HPC and distributed platforms: Slurm, Ray, Kubernetes, Dask
- Familiarity with cluster networking: Infiniband, RoCE, GPUDirect
π Benefits
- Competitive compensation, including meaningful equity
- 100% medical/dental/vision coverage for employees and dependents
- Generous PTO including Winter Break
- Paid parental leave
- Company-facilitated 401(k)
- Exposure to a variety of ML startups and learning opportunities
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!