QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

Pengxiang Ding¹, Han Zhao^1,2, Wenjie Zhang¹, Wenxuan Song¹, Min Zhang¹, Siteng Huang¹, Ningxi Yang^1,2, Donglin Wang¹

¹MiLAB, Westlake University

²Zhejiang University

ECCV 2024

Overview

We propose QUAdruped Robotic Transformer (QUART), a VLA model to integrate visual information and instructions from diverse modalities as input and generates executable actions for real-world robots and present QUAdruped Robot Dataset (QUARD), a large-scale multi-task dataset including perception, navigation and advanced capability like whole-body manipulation tasks for training QUART model. Our extensive evaluation shows that our approach leads to performant robotic policies and enables QUART to obtain a range of generalization capabilities.

Dataset

We collect a large-scale multi-task dataset QUAdruped Robot Dataset (QUARD). It includes multiple tasks such as perception, navigation, and advanced capabilities like object avoidance. To the best of our knowledge, this is the first quadruped robot dataset that incorporates a significant amount of vision, language instruction, and robot command data. As collecting data on real robots is expensive and inefficient, we primarily rely on data generated in simulation, which exhibits significant differences in visual, sensor, and system dynamics.

QUART Framework

The architecture of QUART is designed to leverage the scene comprehension capability of a pretrained MLLM. It receives visual information as observation, and outputs an action representing the actual action taken by the robot based on text-form instructions, and de-tokenizes it into specific action values. QUART can generate a complete action sequence at a processing rate of 2Hz in actual scenarios, and hand it over to the underlying low-level strategy for execution.

Effectiveness in seen and unseen scenes

We show both simulation and real scenarios of all six tasks: Go to, Distinguish, Go through, Crawl, Go avoid and Unload .

Distinguish

Go to

Go through

Crawl

Go avoid

Sim2Real transfer capabilities

We here show results in different sim2real training paradigms and failure case analysis.

1. Simulation + Real Data

2. 10% Simulation + Real Data

3. Simulation Data

4. Real Data

Rubustness in different localization

We compare the results in different initial localization, it shows that our model is robust to different initial localization.

BibTeX

@inproceedings{ding2025quar,
  title={Quar-vla: Vision-language-action model for quadruped robots},
  author={Ding, Pengxiang and Zhao, Han and Zhang, Wenjie and Song, Wenxuan and Zhang, Min and Huang, Siteng and Yang, Ningxi and Wang, Donglin},
  booktitle={European Conference on Computer Vision},
  pages={352--367},
  year={2025},
  organization={Springer}
}