Overview
Alchemy is seeking a Senior Site Reliability Engineer to join our platform reliability team. You will design, build, and operate scalable, highly available services powering Alchemy's products. You will own incident response, monitoring, and reliability improvements across distributed cloud-based services.
Responsibilities
- Design, implement, and maintain scalable infrastructure and CI/CD pipelines.
- Own on-call rotations and lead incident response and post-incident reviews.
- Build and maintain observability stack using Prometheus, Grafana, and related tooling.
- Drive reliability improvements with infrastructure as code (Terraform, Kubernetes).
- Collaborate with software engineers to define SLOs/SLIs and perform capacity planning.
- Document runbooks and maintain robust incident playbooks.
Requirements
- 5+ years of Site Reliability Engineering, DevOps, or equivalent experience.
- Strong Linux system administration and networking fundamentals.
- Hands-on experience with cloud providers (AWS, GCP, Azure).
- Proficiency with Kubernetes and container orchestration.
- Experience with Prometheus, Grafana, and monitoring/observability tooling.
- Infrastructure as Code experience (Terraform, CloudFormation).
- Proficiency in at least one language (Go, Python, or similar).
- Excellent communication and collaboration skills.
Nice-to-have
- Experience with Chaos Engineering, SRE practices, and incident management tooling.
- Security-conscious design and compliance considerations.
Benefits
Competitive compensation and stock options, comprehensive health insurance, retirement plan, flexible work arrangements, generous PTO, and opportunities for professional growth at Alchemy.