Related skills
cloud kubernetes distributed systems go observabilityπ Description
- Design and build scalable, multi-tenant AI inference services.
- Develop and operate high-scale distributed systems with reliability.
- Improve observability, capacity management, automation, tooling.
- Collaborate with platform, GPU infra, and product teams to deliver APIs.
- Elevate software design and incident management.
- Shape traffic management, service orchestration, and platform scalability.
π― Requirements
- 5+ years building and operating multi-tenant platforms or distributed backends.
- Strong experience operating high-scale distributed services in production.
- Deep SRE: observability, incident management, reliability, capacity planning.
- 1+ years hands-on Go / Golang in production systems.
- 1+ years Kubernetes experience.
- Cloud-native architectures, microservices, and distributed systems fundamentals.
- Experience debugging performance, scalability, and reliability in production.
- Observability: TTFT, TPOT, and GPU utilization metrics.
- AI/ML: LLM serving knowledge with vLLM or Triton.
- API gateways, traffic routing, or service mesh experience.
- LLM stacks like vLLM or TensorRT-LLM familiarity.
- Inference optimization, rate limiting, routing, or workload orchestration.
π Benefits
- We innovate with purpose and value ownership.
- Career development: conferences, training, LinkedIn Learning.
- We care about well-being with global benefits.
- Salary, bonuses, and equity opportunities.
- DigitalOcean is an equal-opportunity employer.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!