Site Reliability Engineer

Added
less than a minute ago
Location
Type
Full time
Salary
Salary not provided

Related skills

datadog docker terraform grafana prometheus

📋 Description

  • Design, build, and maintain scalable, highly available and fault-tolerant infra for web services and ML workloads.
  • Ensure platform, inference and model training environments are highly available and replicable across HPC clusters.
  • Operate production systems and troubleshoot issues (on-call, data extraction, admin tasks, scaling).
  • Implement and improve monitoring, alerting, and incident response to minimize downtime.
  • Build and maintain CI/CD, containerization, orchestration, logging and alerting for client APIs and large training runs.
  • Participate in on-call rotations to perform root‑cause analysis and prevent future incidents.

🎯 Requirements

  • Master’s degree in Computer Science, Engineering or related field.
  • 7+ years of DevOps/SRE experience.
  • Strong experience with cloud computing and highly available distributed systems.
  • Hands-on CI/CD, containerization and orchestration with Docker and Kubernetes.
  • Monitoring and observability tools: Prometheus, Grafana, Datadog, ELK Stack.
  • Infrastructure-as-code tools like Terraform or CloudFormation.

🎁 Benefits

  • Competitive salary and equity
  • Healthcare: Medical/Dental/Vision for you and family
  • 401K: 6% matching
  • PTO: 18 days
  • Visa sponsorship
  • BetterUp coaching on a voluntary basis

🛃 Visa sponsorship

Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs →