Principal Site Reliability Engineer (AI-first SRE)

Related skills

terraform grafana prometheus python kubernetes

๐Ÿ“‹ Description

  • Architect and maintain self-healing systems with 99.9%+ availability targets.
  • Use AI/ML to automate infra governance and detect IaC anti-patterns.
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data.
  • Build AIOps-based observability and auto-remediation pipelines.
  • Apply predictive modeling to forecast failures before they impact users.
  • Lead chaos, performance, and resilience testing programs.

๐ŸŽฏ Requirements

  • 10+ years in software/systems engineering, with 5+ years in SRE.
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform.
  • Proficiency in Python or Go for automation and tooling.
  • Observability stacks (Prometheus/Grafana/OpenTelemetry) and service meshes.
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations.
  • Strong communication and influencing skills โ€” data over hierarchy.

๐ŸŽ Benefits

  • Access to cutting-edge technologies in a transformative environment.
  • Professional growth and leadership development pathways.
  • A chance to shape reliable and scalable systems with impact.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest โ€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs โ†’