Lead Cloud Infrastructure Engineer / Site Reliability Engineer (SRE)

Added
7 minutes ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

gitops docker terraform prometheus python

📋 Description

  • Collaborate with software teams to ensure Federal infra reliability, performance, and security.
  • Design, deploy, and scale AI/ML/LLM infra across AWS, Azure, or GCP.
  • Manage and optimize Kubernetes (EKS/AKS/GKE) for AI services and data pipelines.
  • Build data/model pipelines for AI workloads using Terraform, Python, and CI/CD.
  • Use GitOps, CI/CD, Docker, and Kubernetes to streamline ML/LLM tasks.
  • Implement monitoring with Prometheus, Grafana, ELK/EFK, Langfuse, and SLI/SLO.

🎯 Requirements

  • Bachelor’s or Master’s in CS/Engineering or equivalent experience.
  • 8+ years in SRE/DevOps/Platform/MLOps/Cloud Infra.
  • 4+ years with Kubernetes (EKS/GKE/AKS) and Docker.
  • Strong Python skills; experience with Zyphyrscript, Bash, Go, or PowerShell.
  • Infrastructure as Code with Terraform and CloudFormation.
  • Kubernetes Operators/Helm, GitOps (ArgoCD/Flux), or Service Mesh.
  • Experience building or automating data/model pipelines for AI/ML/LLM workloads (RAG, fine-tuning, inference).

🎁 Benefits

  • Geographically distributed team with remote/work-from-home options.
  • Work on FedRAMP-compliant cloud platforms.
  • Opportunity to work with open-source network security tech and AI infra.
  • Collaborative, inclusive culture with growth opportunities.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs →