Staff Software Engineer, GPU Infrastructure (HPC)

Added
4 days ago
Type
Full time
Salary
Salary not provided

Related skills

python kubernetes pytorch jax gpu

๐Ÿ“‹ Description

  • Build and scale ML HPC infra: Kubernetes GPU/TPU clusters across clouds.
  • Optimize AI/ML training: cost, reliability, performance; RDMA/NCCL/interconnects.
  • Troubleshoot bottlenecks and failures to minimize disruption.
  • Enable researchers with self-service tools to monitor, debug, and optimize training jobs.
  • Drive innovation in ML infrastructure with JAX, PyTorch, and distributed training.
  • On-call rotation (24x7) with compensation.

๐ŸŽฏ Requirements

  • Deep ML/HPC infra expertise: GPU/TPU clusters and distributed training.
  • Kubernetes at scale: deploy/manage cloud-native clusters for AI workloads.
  • Strong programming: Python for ML tooling and Go for systems; open-source favored.
  • Linux internals, RDMA networking, HPC performance tuning.
  • Research collaboration experience with AI researchers/ML engineers.
  • Self-directed problem solving and driving impact in fast-paced environments.

๐ŸŽ Benefits

  • Open and inclusive culture and work environment.
  • Work with cutting-edge AI research.
  • Weekly lunch stipend, in-office lunches and snacks.
  • Health and dental benefits, mental health budget.
  • Parental leave top-up up to 6 months.
  • Remote-flexible with offices in multiple cities and coworking stipend; 6 weeks vacation.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest โ€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs โ†’