Principal Site Reliability Engineer

Added
13 days ago
Type
Full time
Salary
Salary not provided

Related skills

aws python kubernetes go langchain

๐Ÿ“‹ Description

  • Own and define the long-term reliability strategy and architecture.
  • Design planet-scale, highly resilient systems on AWS and Kubernetes.
  • Lead autonomous operations platforms powered by AI agents.
  • Architect and implement LLM-driven SRE systems using OpenAI API.
  • Establish gold standards for SRE: SLOs, SLAs, incident mgmt.
  • Drive observability architecture at scale: metrics, logs, traces.

๐ŸŽฏ Requirements

  • 10+ years in SRE/Platform/Distributed Systems.
  • Designed and operated large-scale distributed systems.
  • AWS at scale, Kubernetes internals, reliability-focused.
  • Strong Python/Go programming for platforms/tools.
  • Led cross-functional technical initiatives.
  • Experience integrating LLMs into production (OpenAI API) and AI automation.

๐ŸŽ Benefits

  • Tremendous growth and learning opportunities.
  • Challenging, rewarding work with impact.
  • Welcoming, positive work environment.
  • Equal opportunity employer; inclusive.
  • Work with AI-native platform and leading brands.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest โ€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs โ†’