Lead Cloud Infrastructure Engineer / Site Reliability Engineer (SRE)

Added
less than a minute ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

gitops azure terraform aws prometheus

๐Ÿ“‹ Description

  • Collaborate with software teams to ensure reliability and security of Federal infra.
  • Design, deploy, and scale AI/ML/LLM infra across AWS/Azure/GCP.
  • Manage Kubernetes (EKS/AKS/GKE) for AI services and data pipelines.
  • Automate data/model pipelines for AI workloads using Terraform, Python, CI/CD.
  • Use GitOps, CI/CD, Docker, Kubernetes to streamline ML/LLM tasks.
  • Implement monitoring with Prometheus, Grafana, ELK/EFK, Langfuse.

๐ŸŽฏ Requirements

  • Bachelor's or Master's in CS/Engineering or equivalent.
  • 8+ years in SRE, DevOps, or Cloud Infrastructure.
  • 4+ years Kubernetes (EKS/GKE/AKS) and Docker.
  • Strong Python; Bash/Go/PowerShell or Zyphyrscript.
  • IaC tools: Terraform, CloudFormation.
  • Kubernetes Operators, Helm, GitOps (ArgoCD/Flux).
  • Serverless (AWS Lambda, Azure Functions).
  • AI/ML pipelines experience (RAG, fine-tuning, inference).

๐ŸŽ Benefits

  • Globally distributed team with work-from-home and office options.
  • Inclusive culture and opportunities for growth.
  • Work on cutting-edge AI/ML security technology.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest โ€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs โ†’