Related skills
datadog bash python kubernetes goπ Description
- Monitor the GTO Datadog dashboard; decide if alerts need an incident.
- Triage and investigate incidents; create Jira tickets; determine blast radius.
- Own low-severity incidents end-to-end; diagnose and resolve; escalate when needed.
- Support TSO Lead during major incidents; surface real-time data; update tickets.
- Draft incident communications: Slack updates, stakeholder notifications, and status pages.
- Analyze incident trends during non-incident periods; compile data and reports.
π― Requirements
- 4+ years in SRE/DevOps/production ops; payments/e-commerce/gaming preferred.
- Strong troubleshooting; trace issues across logs, APM, metrics, DB queries, and network.
- Datadog or equivalent; navigate APM, build log queries, read dashboards, interpret SLO burn rates.
- Scripting in Python, Go, or Bash; write automation scripts and tooling via APIs.
- Kubernetes and cloud infra (GCP preferred; AWS/Azure acceptable); pods, deployments, ingress.
- Experience with incident management tools: JIRA Service Management, PagerDuty/OpsGenie, Slack, Confluence; strong English communication.
π Benefits
- Experience in gaming/payments/fintech; uptime-critical environments.
- Familiarity with Datadog Service Catalog, synthetic monitoring, and RUM.
- JIRA Service Management admin or ITIL cert; practical experience.
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Operations Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!