Tired of Manually Applying to Jobs?

Let JobCopilot do it for you!

Set your preferences and let your AI copilot handle the job search while you sleep.

Applies for jobs that actually match your skills

Tailors your resume and cover letter automatically

Works 24/7—so you don't have to

Architect a multi-tenant orchestration layer for GPU clusters with high utilization.
Design and implement scheduling primitives to optimize training job lifecycles.
Develop observability and automated health checks to proactively identify hardware issues.
Evaluate and integrate CNCF/AI tech (Ray, Kueue) for data-driven build vs. buy decisions.
Collaborate with Finance and Procurement to drive capacity planning.
Participate in on-call to ensure service availability.

5+ years in backend or infra engineering, with 2+ years on ML workloads at scale.
Strong programming skills in Python, Go, Rust, or C++.
Experience with compute management systems covering queueing, quotas, preemption, and gang scheduling.
Experience with distributed training infra (EFA, InfiniBand) and topology-aware scheduling.
Experience with distributed storage (Lustre, S3) related to training throughput.
Kubernetes internals expertise (CRDs, Operators, Admission Controllers) and device plugins.

AI Infrastructure Engineer - Training Platform

Meet JobCopilot: Your Personal AI Job Hunter