Added
7 days ago
Type
Full time
Salary
Upgrade to Premium to se...
Related skills
grafana prometheus observability reliability incident managementπ Description
- Own end-to-end fleet reliability from Day 0 provisioning to Day 2 steady-state.
- Lead cross-functional programs to improve fleet delivery, readiness, and stability.
- Define program plans, milestones, dependencies, risks, and success criteria.
- Proactively manage cross-team dependencies and unblock execution.
- Establish and own fleet reliability metrics and dashboards.
- Use data and post-incident learnings to prioritize reliability investments.
π― Requirements
- Bachelor's degree in Computer Engineering or related field.
- 10+ years in technical program management for large-scale compute infrastructure.
- Experience with observability/monitoring/telemetry (Prometheus, Grafana, OpenTelemetry).
- Experience leading cross-functional programs for reliability and availability.
- Strong technical knowledge across compute, storage, networking, or SRE.
- Data-driven with metrics to guide prioritization and decisions.
- Excellent communication and stakeholder mgmt; executive reporting.
π Benefits
- Base salary: $188,000-$275,000 USD.
- Medical, dental, and vision insurance (100% paid).
- 401(k) with employer match.
- Flexible PTO and paid parental leave.
- Tuition reimbursement and ESPP eligibility.
- Mental wellness benefits and family-forming support.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!