Added
26 days ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

azure scripting aws python gcp

๐Ÿ“‹ Description

  • Monitor and improve reliability of production AI systems and workflow infra
  • Identify, diagnose, and fix issues across apps, integrations, and cloud
  • Own incident response incl. root cause analysis and long-term remediation
  • Implement monitoring, alerting, and observability to ensure uptime
  • Collaborate with Engineering to harden deployments and scale systems
  • Support customer-facing teams by troubleshooting live environment issues
  • Document configurations, operational procedures, and recovery protocols
  • Continuously improve reliability standards, deployment practices, and safeguards

๐ŸŽฏ Requirements

  • 3+ years of experience in support engineering, site reliability engineering, or infrastructure maintenance
  • Strong proficiency in Python or scripting languages
  • Experience managing cloud infrastructure (AWS, GCP, or Azure)
  • Strong problem-solving skills and a proactive, preventative mindset
  • Clear communication skills and ability to collaborate across engineering and customer-facing teams
  • High ownership and accountability in high-reliability environments
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest โ€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs โ†’