Related skills
datadog terraform aws grafana prometheusπ Description
- Build self-service platform infrastructure for product teams.
- Automate repetitive manual tasks to reduce toil.
- Implement monitoring, alerts, and dashboards for reliability.
- On-call rotation to respond to incidents with resilience.
- Plan capacity and optimize performance for scale.
- Collaborate with security, product engineering, and SRE teams.
π― Requirements
- 5+ years distributed systems and microservices in production.
- Strong AWS experience (EC2, ECS/EKS, VPC, IAM) and multi-AZ.
- Terraform or CloudFormation fluency; think in code, not clickops.
- Go or Python for automation tooling.
- Kubernetes multi-tenancy production experience in multi-tenant clusters.
- Observability with Prometheus, Grafana, Datadog.
- Incident response experience: on-call, outages, postmortems.
- Security-minded approach: least-privilege, encryption, threat models.
π Benefits
- Autonomy and ownership over architectural decisions.
- Modern stack with current tooling.
- Sustainable on-call with fair rotation.
- Collaborative culture with design reviews and knowledge sharing.
- Remote-first with async communication.
- Travel for in-person engagement may be required.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!