Related skills
docker terraform aws grafana docker composeπ Description
- Define and track SLIs/SLOs for services in the customer env.
- Use error budgets to drive reliability conversations with the Arlington team.
- Eliminate toil by automating repetitive tasks in the secure enclave.
- Conduct post-incident reviews and root-cause analysis with the team.
- Own on-site observability: dashboards, alerts, and logs with LGTM stack.
- Lead on-site incident response: triage, containment, and customer comms.
π― Requirements
- 5+ years in SRE/production ops or related infra role.
- Proven experience defining/tracking SLIs, SLOs, and error budgets.
- Hands-on Docker, Docker Compose, and AWS in production.
- Linux/Unix admin in constrained environments.
- Terraform for infra provisioning with policy guardrails.
- LGTM stack (Grafana, Loki, Prometheus/Mimir) and strong incident response.
π Benefits
- Equal opportunity employer.
- Reasonable accommodation during the hiring process.
- Opportunity to work on mission-critical, national-security programs.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!