Related skills
terraform prometheus kubernetes apache kafka nodejsπ Description
- Define and drive automation strategy for infrastructure tooling; standards to minimize manual work and incidents
- Own the design, reliability and evolution of core platform apps; mentor team members
- Architect and lead the logging platform strategy; balance availability, retention and cost
- Establish capacity planning and performance management frameworks; guide complex troubleshooting
- Lead cross-functional reliability initiatives with SRE and service engineering teams
- Demonstrate autonomy in identifying systemic weaknesses and platform improvements
π― Requirements
- Proven track record in Site Reliability or Software Engineering; scalable, reliable services
- Deep expertise in Pulsar, Kafka, Loki, ScyllaDB/Cassandra distributed systems
- Designing/owning automation strategies; performance analysis frameworks; diagnostics
- 6+ years hands-on in SRE or software engineering; multi-cloud AWS/GCP
- Observability platforms with SLOs; Prometheus/Thanos; Grafana/Loki/Tempo
- On-call excellence; monitoring/alerting improvements; runbooks; week-long on-call rotations
π Benefits
- Collaborative, inclusive culture; commitment to diversity
- Opportunity to work on multi-cloud, multi-region platform
- Grow with a leading martech company serving global brands
- Cross-team mentorship and technical leadership
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!