This job is no longer available

The job listing you are looking has expired.
Please browse our latest remote jobs.

See open jobs →
← Back to all jobs

Research Engineer - Distributed Training

Added
3 days ago
Location
Type
Full time
Salary
Not Specified

Use AI to Automatically Apply!

Let your AI Job Copilot auto-fill application questions
Auto-apply to relevant jobs from 300,000 companies

Auto-apply with JobCopilot Apply manually instead
Save job

About CloudWalk:

CloudWalk is building the intelligent infrastructure for the future of financial services. Powered by AI, blockchain, and thoughtful design, our systems serve millions of entrepreneurs across Brazil and the US every day.

Our AI team trains large-scale language models that power real products - from payment intelligence and credit scoring to on-device assistants for merchants.

About the Role:

We’re looking for a Research Engineer to design, scale, and evolve CloudWalk’s distributed training stack for large language models. You’ll work at the intersection of research and infrastructure - running experiments across DeepSpeed, FSDP, Hugging Face Accelerate, and emerging frameworks like Unsloth, TorchTitan, and Axolotl.

You’ll own the full training lifecycle: from cluster orchestration and data streaming to throughput optimization and checkpointing at scale. If you enjoy pushing the limits of GPUs, distributed systems, and next-generation training frameworks, this role is for you.

Responsibilities:

  • Design, implement, and maintain CloudWalk’s distributed LLM training pipeline.
  • Orchestrate multi-node, multi-GPU runs across Kubernetes and internal clusters.
  • Optimize performance, memory, and cost across large training workloads.
  • Integrate cutting-edge frameworks (Unsloth, TorchTitan, Axolotl) into production workflows.
  • Build internal tools and templates that accelerate research-to-production transitions.
  • Collaborate with infra, research, and MLOps teams to ensure reliability and reproducibility.
  • Requirements:

  • Strong background in PyTorch and distributed training (DeepSpeed, FSDP, Accelerate).
  • Hands-on experience with large-scale multi-GPU or multi-node training.
  • Familiarity with Transformers, Datasets, and mixed-precision techniques.
  • Understanding of GPUs, containers, and schedulers (Kubernetes, Slurm).
  • Mindset for reliability, performance, and clean engineering.
  • Bonus:

  • Experience with Ray, MLflow, or W&B.
  • Knowledge of ZeRO, model parallelism, or pipeline parallelism.
  • Curiosity for emerging open-source stacks like Unsloth, TorchTitan, and Axolotl.
  • Additional Information

    Our process is simple: a deep conversation on distributed systems and LLM training, and a cultural interview.

    Competitive salary, equity, and the opportunity to shape the next generation of large-scale AI infrastructure at CloudWalk.

    Use AI to Automatically Apply!

    Let your AI Job Copilot auto-fill application questions
    Auto-apply to relevant jobs from 300,000 companies

    Auto-apply with JobCopilot Apply manually instead
    Share job

    Meet JobCopilot: Your Personal AI Job Hunter

    Automatically Apply to Remote Engineering Jobs. Just set your preferences and Job Copilot will do the rest—finding, filtering, and applying while you focus on what matters.

    Related Engineering Jobs

    See more Engineering jobs →