This job has expired

This job posting is no longer active and is not accepting applications. Explore similar roles below!

Posted 1mo ago

Member of Technical Staff, Vision-Language-Action (VLA)

@ XDOF
San Francisco, California, United States
HybridFull Time
Responsibilities:design pipelines, develop representations, build datasets
Requirements Summary:MS or PhD in CS/Robotics/ML; 3–7 years in vision-language models, video understanding, robot learning, or related areas; deep PyTorch; published work; strong engineering skills.
Technical Tools Mentioned:PyTorch, Distributed Training, Mixed Precision, Large Batch Workflows
Save
Mark Applied
Hide Job
Report & Hide
Job Description

About XDOF

Frontier labs are racing to build general-purpose robots, and the bottleneck isn't compute. It's data. At XDOF, we're building the foundation behind the foundation models: the data collection systems, annotation pipelines, exabyte-scale data infrastructure, and software toolchain that enable our partners to push the field forward.

We're hiring a Research Engineer / Scientist to lead technical efforts at the intersection of vision-language models and robot learning. You will build systems that turn raw egocentric and teleoperation video into high-signal training data for VLA models, and increasingly, contribute to the models themselves.

Beyond pipelines, you will drive research into what makes robot data useful: discovering new metadata (contact events, affordance labels, implicit reward signals, dynamics priors from video) that unlock capabilities current approaches miss. You'll explore how structured annotations can improve cross-embodiment transfer, automatic curriculum generation, and world models that predict what actually matters for manipulation. The data layer isn't downstream of the research. It is the research.

What You'll Do

  • Design and implement vision-language pipelines for egocentric and teleoperation video: structured captioning, temporal grounding, action-conditioned scene understanding, and semantic annotation at scale

  • Develop and evaluate representations that bridge visual perception, language, and low-level robot action — spanning VLAs, video prediction, and world models

  • Build and improve data curation systems that assess quality, diversity, and coverage of large-scale robot demonstration datasets

  • Work hands-on with bimanual and high-DoF manipulation data, including real teleoperation footage and sim-generated rollouts

  • Collaborate directly with partner labs to define data requirements and close the loop between data quality and downstream policy performance

  • Stay current on the research frontier (VLAs, video foundation models, flow matching, DiT architectures, egocentric pretraining) and translate insights into production systems

About You

Required:

  • MS or PhD in Computer Science, Robotics, Machine Learning, or a related field from a top-tier program

  • 3–7 years of research or applied research experience (industry or academic) in one or more of: vision-language models, video understanding, robot learning, or generative modeling

  • Deep fluency in PyTorch; working knowledge of large-scale training infrastructure (distributed training, mixed precision, large batch workflows)

  • Published work or demonstrable impact in VLMs/VLAs, video representation learning, imitation learning, or a closely related area

  • Strong engineering fundamentals — you can design clean systems, not just run experiments

Benefits

  • Competitive compensation and equity

  • Comprehensive health and wellness benefits

  • Flexible work arrangements

  • Collaborative and fast-paced work environment

  • Opportunity to shape the future of robotics and AI alongside an ambitious, values-driven team

Level: Mid Level to Senior Research Scientist (L4–L5 equivalent) Location: [San Mateo / Remote Policy]

Note: Junior candidates will still be considered

If you’re excited to help build the infrastructure powering tomorrow’s intelligent machines, we’d love to hear from you!