Related skills
docker ansible terraform linux prometheusπ Description
- Design, implement, and operate scalable production services.
- Build alerting pipelines, dashboards, and SLO monitoring.
- Lead end-to-end incident response and post-mortems.
- Develop and extend IaC coverage; build internal tooling.
- Mentor SRE I/II through code reviews and knowledge sharing.
- Apply LLM-driven log analysis and AI tooling to incidents.
π― Requirements
- 6+ years in SRE/Platform Engineering with reliability programs
- Kubernetes and Docker in production
- Advanced Linux troubleshooting: kernel internals, TCP/IP, DNS, load balancers
- Python for automation and Bash scripting
- Ansible and Terraform or Pulumi; mastery of Icinga, Prometheus, Grafana
- Understanding of LLMs, embeddings, and ML pipelines for AI-ops
π Benefits
- Competitive health benefits
- Retirement savings options (e.g., 401k)
- Equity grants and employee stock purchase plan
- Paid time off and parental leave
- Diversity and inclusion programs
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!