Related skills
cloud kubernetes distributed systems job scheduling hpc📋 Description
- Design and run research infra for large-scale experiments, GPUs.
- Build services to schedule, orchestrate, and observe workloads.
- Improve tooling to boost researcher productivity and monitoring.
- Influence roadmaps for research compute, training, and delivery.
- Mentor engineers on compute, infra, and AI systems.
- Partner with researchers, ML engineers, and platform teams.
🎯 Requirements
- BS/MS or PhD in Computer Science or related field.
- 5+ years in software engineering with large-scale distributed systems.
- Deep experience building/operating distributed systems and data pipelines.
- Proficient in one or more systems languages (C++, Rust, Go, Java, Scala).
- Experience with cluster schedulers or large-scale job orchestration (Kubernetes, Slurm, Ray).
- Understand ML training and inference workflows (distributed training, eval).
🎁 Benefits
- Comprehensive benefits and region-specific details online.
- Inclusive culture with a commitment to diversity.
- Work on cutting-edge AI research infrastructure.
- Global teams and offices with SF presence.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!