Added
8 days ago
Type
Full time
Salary
Salary not provided

Related skills

datadog bash python kubernetes go

πŸ“‹ Description

  • Monitor the GTO Datadog dashboard; decide if alerts need an incident.
  • Triage and investigate incidents; create Jira tickets; determine blast radius.
  • Own low-severity incidents end-to-end; diagnose and resolve; escalate when needed.
  • Support TSO Lead during major incidents; surface real-time data; update tickets.
  • Draft incident communications: Slack updates, stakeholder notifications, and status pages.
  • Analyze incident trends during non-incident periods; compile data and reports.

🎯 Requirements

  • 4+ years in SRE/DevOps/production ops; payments/e-commerce/gaming preferred.
  • Strong troubleshooting; trace issues across logs, APM, metrics, DB queries, and network.
  • Datadog or equivalent; navigate APM, build log queries, read dashboards, interpret SLO burn rates.
  • Scripting in Python, Go, or Bash; write automation scripts and tooling via APIs.
  • Kubernetes and cloud infra (GCP preferred; AWS/Azure acceptable); pods, deployments, ingress.
  • Experience with incident management tools: JIRA Service Management, PagerDuty/OpsGenie, Slack, Confluence; strong English communication.

🎁 Benefits

  • Experience in gaming/payments/fintech; uptime-critical environments.
  • Familiarity with Datadog Service Catalog, synthetic monitoring, and RUM.
  • JIRA Service Management admin or ITIL cert; practical experience.
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Operations Jobs. Just set your preferences and Job Copilot will do the rest β€” finding, filtering, and applying while you focus on what matters.

Related Operations Jobs

See more Operations jobs β†’