Own infrastructure lifecycle: provisioning, upgrades, scaling, decommissioning (IaC-first)
Operate and scale ClickHouse clusters: sharding, replication, tuning
Operate Kafka ingestion backbone; improve throughput, lag, backpressure, recovery
Improve latency and reliability for data-heavy serving and query workloads
Build and maintain monitoring/alerting: SLIs/SLOs, dashboards, runbooks
Define, implement, and improve incident response standards and on-call practices

🎯 Requirements

Track record owning production infra for data-heavy, low-latency systems end to end
Strong hands-on experience operating ClickHouse, Kafka, and related data systems
Practical experience with Snowflake workflows and cross-system data architecture
Ability to define operational standards (runbooks, incident process) and enforce
Strong operational experience with Kubernetes, Terraform, and cloud infra
Excellent communication and collaboration across engineering and research teams

Apply on employer's website

This employer gathers applications via their own applicant tracking system.

You will be redirected to an external application form.

Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Activate JobCopilot