RT-2: Vision-Language-Action Models

Results

We start the evaluation of RT-2 with testing the emergent properties of the model. Since we can’t fully anticipate the extend of RT-2’s generalization, we present a number of previously unseen objects to the robot and evaluate its performance on tasks that require semantic understanding that goes far beyond the robot data that the model was fine-tuned on. You can see qualitative examples of successful tasks that we found surprising below:

To quantify the emergent properties of RT-2, we categorize them into: symbol understanding, reasoning and human recognition and evaluate two variants of RT-2:

RT-2 trained on top of PaLM-E (12B parameters),
RT-2 trained on top of PaLI-X (55B parameters)

against its predecessor – RT-1 and another visual pre-training method – VC-1. The results below demonstrate a significant improvement of RT-2 compared to the baselines (3x).

We evaluate the two variants of RT-2, together with more baselines in a blind A/B study and present the results across multiple generalization axes below. The resulting generalization improvement of RT-2 is approximately 2x.

To better understand how different design choices of RT-2 impact the generalization results we ablate the two most significant design decisions:

the model size: 5B vs 55B for the RT-2 PaLI-X variant,
training recipe: training the model from scratch vs fine-tuning vs co-fine-tuning.

The results below indicate the importance of the pre-trained weights of the vision-langauge model as well as the trend of the model generalization improving with the model size.

We also evaluate RT-2 on an open-source language-table benchmark where we train RT-2 on simulation and real language-table data. In addition to achieving the state-of-the-art result on the simulation benchmark (90% vs 77% of the previous SoTA), we evaluate the resulting model in the real world. We demonstrate RT-2’s generalization capabilities with the objects never seen in language table datasets before such as ketchup bottle, banana and others:

Lastly, since the resulting RT-2 PaLM-E version of the model is a vision-language-action model that can act as an LLM, VLM and a robotic controller all in a single neural network, we demonstrate that RT-2 can perform chain-of-thought reasoning for control. In the examples below RT-2 first outputs a few reasoning steps in natural language which are then followed by the string: `Action:` and the resulting action tokens.

This shows the promise of fully integrated VLA models that can transfer not only some of the semantic concepts across different modalities (e.g. generalize robot actions to new semantic categories) but also some of the properties of the underlying models (e.g. chain-of-thought reasoning).

Results

Must Read

Leave a Comment Cancel Reply