Related skills
python kubernetes pytorch jax gpu๐ Description
- Build and scale ML HPC infra: Kubernetes GPU/TPU clusters across clouds.
- Optimize AI/ML training: cost, reliability, performance; RDMA/NCCL/interconnects.
- Troubleshoot bottlenecks and failures to minimize disruption.
- Enable researchers with self-service tools to monitor, debug, and optimize training jobs.
- Drive innovation in ML infrastructure with JAX, PyTorch, and distributed training.
- On-call rotation (24x7) with compensation.
๐ฏ Requirements
- Deep ML/HPC infra expertise: GPU/TPU clusters and distributed training.
- Kubernetes at scale: deploy/manage cloud-native clusters for AI workloads.
- Strong programming: Python for ML tooling and Go for systems; open-source favored.
- Linux internals, RDMA networking, HPC performance tuning.
- Research collaboration experience with AI researchers/ML engineers.
- Self-directed problem solving and driving impact in fast-paced environments.
๐ Benefits
- Open and inclusive culture and work environment.
- Work with cutting-edge AI research.
- Weekly lunch stipend, in-office lunches and snacks.
- Health and dental benefits, mental health budget.
- Parental leave top-up up to 6 months.
- Remote-flexible with offices in multiple cities and coworking stipend; 6 weeks vacation.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest โ finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!