Related skills
python pandas apache spark data pipelines data qualityπ Description
- Maintain large-scale pipelines for processing web corpora.
- Implement filtering and quality-scoring systems for high-value web docs.
- Analyze web data composition across domains, languages and time.
- Develop and maintain highly-performant deduplication pipelines.
- Collaborate with cross-functional teams to ensure data pipelines meet the demands of cutting-edge language models.
π― Requirements
- Strong software engineering skills with Python and experience building data pipelines.
- Familiarity with data processing frameworks such as Apache Spark, Apache Beam, Pandas, or similar tools.
- Experience working with large-scale web datasets.
- Knowledge of data quality assessment techniques and experimentation with data mixtures.
- A passion for bridging research and engineering to solve complex data-related challenges in AI model training.
- Bonus: papers at top-tier venues (NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP).
π Benefits
- Open and inclusive culture and work environment
- Work with a team on AI research
- Weekly lunch stipend, in-office lunches and snacks
- Full health and dental benefits, mental health budget
- 100% parental leave top-up for up to 6 months
- Remote-flexible with offices in major cities and coworking stipend
Meet JobCopilot: Your Personal AI Job Hunter
Automatically Apply to Engineering Jobs. Just set your
preferences and Job Copilot will do the rest β finding, filtering, and applying while you focus on what matters.
Help us maintain the quality of jobs posted on Empllo!
Is this position not a remote job?
Let us know!