Related skills
datadog docker terraform grafana prometheus📋 Description
- Design, build, and maintain scalable, highly available and fault-tolerant infra for web services and ML workloads.
- Ensure platform, inference and model training environments are highly available and replicable across HPC clusters.
- Operate production systems and troubleshoot issues (on-call, data extraction, admin tasks, scaling).
- Implement and improve monitoring, alerting, and incident response to minimize downtime.
- Build and maintain CI/CD, containerization, orchestration, logging and alerting for client APIs and large training runs.
- Participate in on-call rotations to perform root‑cause analysis and prevent future incidents.
🎯 Requirements
- Master’s degree in Computer Science, Engineering or related field.
- 7+ years of DevOps/SRE experience.
- Strong experience with cloud computing and highly available distributed systems.
- Hands-on CI/CD, containerization and orchestration with Docker and Kubernetes.
- Monitoring and observability tools: Prometheus, Grafana, Datadog, ELK Stack.
- Infrastructure-as-code tools like Terraform or CloudFormation.
🎁 Benefits
- Competitive salary and equity
- Healthcare: Medical/Dental/Vision for you and family
- 401K: 6% matching
- PTO: 18 days
- Visa sponsorship
- BetterUp coaching on a voluntary basis
🛃 Visa sponsorship
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!