Related skills
docker python kubernetes airflow pysparkπ Description
- Design and scale distributed data pipelines for preprocessing and dataset refreshes
- Own workflow orchestration, scheduling, monitoring, and recovery for large data jobs
- Implement and maintain containerized pipeline infrastructure using Kubernetes
- Optimize cloud storage and movement across AWS, GCS, and Azure
- Define dataset storage layout, versioning, caching, and access patterns
- Design curation pipelines for video and image data, including image-text pairs
π― Requirements
- Hands-on experience building ML data pipelines, including dataset curation and quality improvement
- Experience with distributed data processing using PySpark or Ray, and orchestration tools like Airflow
- Familiarity with containerization and orchestration, including Docker and Kubernetes
- Experience with cloud storage/compute (AWS, GCS, Azure) and cost-throughput tradeoffs
- Experience with VLM-based captioning pipelines or quality/aesthetic scoring models for video or image data
- Familiarity with CLIP-based or embedding-based filtering and semantic data selection techniques
π Benefits
- Competitive salary and generous company equity
- Personal time off and paid holidays
- Health insurance
- Global travel insurance: Covers you when traveling internationally
- Monthly spending stipend: $500 (~S$635)
- Equipment: All equipment needed for your home office
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!