Site Reliability Engineer, Inference Infrastructure

Added
10 days ago
Type
Full time
Salary
Salary not provided

Related skills

golang linux aws kubernetes gcp

📋 Description

  • Build self-service systems that automate managing deploying and operating services.
  • Develop Kubernetes operators to support language model deployments.
  • Automate environment observability and resilience for developers.
  • Ensure defined SLOs with participation in on-call rotation.
  • Collaborate with internal teams and influence Infrastructure roadmap.
  • Share knowledge and lead active review processes within the team.

🎯 Requirements

  • 5+ years of engineering experience running production infrastructure at scale.
  • Experience designing large, distributed systems with Kubernetes and GPU workloads.
  • Experience with Kubernetes development and production coding and support.
  • Experience with GCP, Azure, AWS, OCI; multi-cloud on-prem/hybrid serving.
  • Experience in Linux-based computing environments: design, deploy, troubleshoot.
  • Experience in Golang or C++ for high-performance servers.

🎁 Benefits

  • Open and inclusive culture and work environment.
  • Work with a team on the cutting edge of AI research.
  • Weekly meals stipend, in-office lunches and snacks.
  • Full health and dental benefits including mental health budget.
  • 6 weeks of vacation (30 working days).
  • Remote-flexible with offices in multiple cities and coworking stipend.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs →