Site Reliability Engineer I

Added
13 days ago
Type
Full time
Salary
Salary not provided

Related skills

datadog docker aws python kubernetes

๐Ÿ“‹ Description

  • Design systems with resilience and capacity in mind.
  • Define and measure SLOs/SLIs reflecting customer impact.
  • Use Datadog with CloudWatch for signal-heavy observability.
  • Configure alerting and routing via incident.io for on-call.
  • Improve incident lifecycle from detection to postmortems.
  • Build reliable, debuggable systems; minimize 2 a.m. alerts.

๐ŸŽฏ Requirements

  • Software background; curiosity about large-scale production systems.
  • Proficient in Python or Go; automation experience a plus.
  • Some exposure to AWS, Docker, Kubernetes.
  • Familiar with metrics, logs, and traces; monitoring concepts.
  • Awareness of SLOs/SLIs; reliability matters to end users.
  • BS/BE in CS/Engineering + 1+ year SRE/DevOps/Eng; internships count; AI tooling familiar.

๐ŸŽ Benefits

  • Healthcare
  • Internet / cell phone reimbursement
  • Learning and development stipend
  • Opportunities to travel to Palo Alto HQ and Bangkok Site
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest โ€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs โ†’