Lead Cloud Infrastructure Engineer / Site Reliability Engineer (SRE)
Related skills
gitops docker terraform prometheus python📋 Description
- Collaborate with software teams to ensure Federal infra reliability, performance, and security.
- Design, deploy, and scale AI/ML/LLM infra across AWS, Azure, or GCP.
- Manage and optimize Kubernetes (EKS/AKS/GKE) for AI services and data pipelines.
- Build data/model pipelines for AI workloads using Terraform, Python, and CI/CD.
- Use GitOps, CI/CD, Docker, and Kubernetes to streamline ML/LLM tasks.
- Implement monitoring with Prometheus, Grafana, ELK/EFK, Langfuse, and SLI/SLO.
🎯 Requirements
- Bachelor’s or Master’s in CS/Engineering or equivalent experience.
- 8+ years in SRE/DevOps/Platform/MLOps/Cloud Infra.
- 4+ years with Kubernetes (EKS/GKE/AKS) and Docker.
- Strong Python skills; experience with Zyphyrscript, Bash, Go, or PowerShell.
- Infrastructure as Code with Terraform and CloudFormation.
- Kubernetes Operators/Helm, GitOps (ArgoCD/Flux), or Service Mesh.
- Experience building or automating data/model pipelines for AI/ML/LLM workloads (RAG, fine-tuning, inference).
🎁 Benefits
- Geographically distributed team with remote/work-from-home options.
- Work on FedRAMP-compliant cloud platforms.
- Opportunity to work with open-source network security tech and AI infra.
- Collaborative, inclusive culture with growth opportunities.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!