Lead Site Reliability Engineer

Added
less than a minute ago
Type
Full time
Salary
Salary not provided

Related skills

terraform prometheus kubernetes apache kafka nodejs

πŸ“‹ Description

  • Define and drive automation strategy for infrastructure tooling; standards to minimize manual work and incidents
  • Own the design, reliability and evolution of core platform apps; mentor team members
  • Architect and lead the logging platform strategy; balance availability, retention and cost
  • Establish capacity planning and performance management frameworks; guide complex troubleshooting
  • Lead cross-functional reliability initiatives with SRE and service engineering teams
  • Demonstrate autonomy in identifying systemic weaknesses and platform improvements

🎯 Requirements

  • Proven track record in Site Reliability or Software Engineering; scalable, reliable services
  • Deep expertise in Pulsar, Kafka, Loki, ScyllaDB/Cassandra distributed systems
  • Designing/owning automation strategies; performance analysis frameworks; diagnostics
  • 6+ years hands-on in SRE or software engineering; multi-cloud AWS/GCP
  • Observability platforms with SLOs; Prometheus/Thanos; Grafana/Loki/Tempo
  • On-call excellence; monitoring/alerting improvements; runbooks; week-long on-call rotations

🎁 Benefits

  • Collaborative, inclusive culture; commitment to diversity
  • Opportunity to work on multi-cloud, multi-region platform
  • Grow with a leading martech company serving global brands
  • Cross-team mentorship and technical leadership
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs β†’