Lead Cloud Infrastructure Engineer / Site Reliability Engineer (SRE)

Added
1 hour ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

gitops docker terraform cloudformation bash

📋 Description

  • Collaborate with software teams to ensure reliability, performance, and security of federal infra.
  • Design, deploy, and scale AI/ML/LLM infra on AWS/Azure/GCP for reliability.
  • Manage Kubernetes (EKS/AKS/GKE) for AI services and data pipelines.
  • Build data/model pipelines for fine-tuning, inference, and RAG using Terraform, Python.
  • Leverage GitOps, CI/CD, and Docker/Kubernetes to streamline ML/LLM tasks.
  • Implement observability with Prometheus, Grafana, ELK/EFK, Langfuse, and SLI/SLO/SLA.

🎯 Requirements

  • Bachelor’s or Master’s in CS/Engineering or related field, or equivalent.
  • 8+ years in SRE, DevOps, Platform Engineering, MLOps, or Cloud Infrastructure.
  • 4+ years production with Kubernetes (EKS/GKE/AKS) and Docker.
  • Strong Python; proficient in Bash/Go.
  • Terraform and CloudFormation (IaC).
  • Kubernetes Operators, Helm, GitOps (ArgoCD/Flux).
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs →