About Twilio
Twilio is a leading cloud communications platform powering programmable communications globally. We are seeking a Manager, NOC to drive reliability and incident response for our critical services.
Role summary
As the Manager, NOC, you will lead a team of NOC engineers responsible for monitoring, incident management, and uptime across Twilio's platforms. You will own on-call scheduling, runbooks, and post-incident reviews, and collaborate with Platform, Security, and Customer Support to meet SLA targets.
Responsibilities
- Lead and mentor a team of NOC engineers, including hiring, coaching, and performance management.
- Develop and maintain robust monitoring and alerting using tools such as Prometheus, Grafana, and Datadog.
- Oversee 24/7 incident management, incident response, and on-call rotations.
- Coordinate with cross-functional teams on incident resolution and root-cause analysis.
- Improve runbooks, automation, and incident post-mortems to reduce toil and improve reliability.
- Assist capacity planning and disaster recovery planning to ensure service resilience across cloud environments (AWS/GCP/Azure).
Qualifications
- 5+ years of NOC or network operations experience, with at least 2+ years in a supervisory or manager role.
- Strong knowledge of modern monitoring and incident management tools (Prometheus, Grafana, Datadog, PagerDuty).
- Experience with cloud platforms (AWS, GCP, Azure) and scripting (Python, Bash).
- Excellent communication and collaboration skills, with the ability to operate in a remote-first environment and across time zones.
- Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
What we offer
- Competitive compensation and benefits
- Remote-first team with flexible hours
- Learning and growth opportunities