On-site reliability engineer at a government site; define SLIs/SLOs.
Lead on-site incident response in restricted air-gapped AWS.
Own observability for on-site deployment (LGTM stack: Grafana, Loki, Tempo, Mimir).
Manage deployment and infra: Docker, Docker Compose, Terraform in enclave.
Automate toil; post-incident reviews and durable fixes.
Liaise with government customers; translate ops needs to engineering.

🎯 Requirements

5+ years in SRE/prod ops or related infra role.
Define/track SLIs/SLOs and error budgets; incident response experience.
Docker, Docker Compose; AWS (EC2, ECS, RDS, VPCs).
Linux/Unix admin; productive in constrained envs without GUI.
Terraform for infra provisioning within guardrails.
LGTM stack (Grafana, Loki, Tempo, Mimir) experience.

Apply on employer's website

This employer gathers applications via their own applicant tracking system.

You will be redirected to an external application form.

Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Activate JobCopilot