Added
less than a minute ago
Location
Type
Full time
Salary
Upgrade to Premium to se...
Related skills
azure aws grafana prometheus pythonπ Description
- Design and implement highly available, scalable infra across AWS, Azure, GCP, and bare-metal.
- Drive automation by writing Python/Go to remove toil and build self-healing systems.
- Improve observability with Prometheus, Grafana, OpenTelemetry; define SLIs/SLOs and error budgets.
- Lead Incident Commander on-call; develop response playbooks and post-incident analyses.
- Partner with Engineering for operability reviews and system maturity improvements.
- Hybrid role with 3 days in San Jose, CA or remote.
π― Requirements
- 8+ years of reliability, scalability for large-scale production services.
- Deep programming expertise: Python, Go, or C/C++.
- Strong networking, Linux/FreeBSD, and distributed architectures.
- Experience in high-stakes incident management and 24/7 on-call rotation.
- ITIL workflows and incident data to drive service maturity.
- Extensive cloud experience with AWS, Azure, GCP and IaC using Ansible, Terraform.
π Benefits
- Various health plans
- Time off plans for vacation and sick time
- Parental leave options
- Retirement options
- Education reimbursement
- In-office perks, and more!
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!