Site Reliability Engineer II

Added
14 days ago
Type
Full time
Salary
Salary not provided

Related skills

datadog python go observability incident management

๐Ÿ“‹ Description

  • Design resilient systems with capacity planning in mind.
  • Define and measure SLOs/SLIs reflecting customer experience.
  • Use Datadog and CloudWatch for signal-heavy observability.
  • Configure alerting and routing via incident.io for on-call relevance.
  • Improve the incident lifecycle: detection, triage, communication, follow-ups.
  • Build highly available, easy-to-debug systems; minimize 2 a.m. alerts.

๐ŸŽฏ Requirements

  • Bachelor's or master's in CS or equivalent industry experience
  • 3+ years in SRE or software engineering
  • Python and/or Go coding
  • Distributed systems design to production experience
  • Reliability mindset: SLOs/SLIs, error budgets, MTTR
  • Observability and incident response; diagnose from logs/metrics
  • Cross-functional communication; documentation and runbooks
  • Operational tooling and AI fluency; IaC workflows
  • Leadership and mentorship to drive reliability initiatives

๐ŸŽ Benefits

  • Healthcare
  • Internet and cell phone reimbursement
  • Learning and development stipend
  • Potential travel to Mountain View HQ
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest โ€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs โ†’