Site Reliability Engineer

Added
3 days ago
Type
Full time
Salary
Salary not provided

Related skills

azure aws grafana prometheus python

πŸ“‹ Description

  • Improve reliability, scalability, performance, and observability for JFrog SaaS.
  • Define SLOs/SLIs, analyze failures, and support capacity planning.
  • Support day-to-day ops of multi-cloud, Kubernetes-based SaaS.
  • Build and enhance internal services/tools to reduce toil via automation.
  • Develop Python/Go automation to improve deployment safety and incident visibility.
  • Run PoCs and drive agentic automation using an ADK/agent framework.

🎯 Requirements

  • 4+ years in SRE, DevOps, or production engineering.
  • Kubernetes (Docker) and at least one cloud provider (AWS, GCP, or Azure).
  • SRE Fundamentals: SLO/SLI, alerting, incident response, postmortems.
  • Development: Python or Go for automation and internal tools.
  • Observability: metrics/logs/traces with Prometheus, Grafana, OpenTelemetry.
  • Incident & Resilience: strong incident response; DR readiness.
  • CI/CD: Jenkins, ArgoCD, or equivalent.
  • Soft Skills: documentation and collaborative problem solving.

🎁 Benefits

  • Hybrid work model with in-office days in Bangalore.
  • Opportunity to work on a global SaaS platform.
  • Collaborative, impact-focused team culture.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’