VLA²:
Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation

Han Zhao*,1,2, Jiaxuan Zhang*,2,3, Wenxuan Song4,
Pengxiang Ding1,2, Donglin Wang2,
*Equal contribution
1Zhejiang University, 2MILAB, Westlake University, 3Southern University of Science and Technology, 4Hong Kong University of Science and Technology (Guangzhou),

We introduce Vision-Language-Action Agent (VLA²), a novel integrated system-level framework designed to increase the capabilities of VLA systems by supporting the invocation of diverse tools, thereby extending the executive limits of the current VLA models.

Evaluation result on our custom Hard-level benchmark involving unseen concepts (i.e., object textures and language descriptions outside the dataset)

The VLA² Framework

VLA² integrates these modules to enhance the ability of VLA: Task Planning, Web/Memory Retrieval, Object Grounding, and Result Verification. Each module implements its function through one or more foundation models.

OOD Information Processing

The figures below illustrates the process of how VLA² handles a task that has an unseen concept (Put the blue and white porcelain bowl on the stove) in the observation.

Vision Processing

Language Processing

Once information is retrieved at the beginning of a task, the processed information can be stored in memory for later use.

Mask-Conditioned VLA

We processed the target objects and placement locations specified in the task descriptions by applying colored masks, and fine-tuned OpenVLA to work with this modified input. This use of colored masks serves as a bridge between upstream unseen concept recognition and downstream task execution by the VLA.

Experiments

Evaluation on Original LIBERO

VLA² remains competitive with all methods using OpenVLA as their backbone (Class 2 baselines), and achieves top-tier performance on all benchmarks except for LIBERO-Object.


Evaluation on Customized Environment

Based on the LIBERO simulation environment, we designed object generalization tasks across three difficulty levels, ranging from simple color variations (Easy) and manipulation of generalized target objects (Medium) to generalization to objects with unseen concepts (Hard).

As the benchmark difficulty increases, all baselines exhibit a significant decline in task success rates, while VLA² demonstrates a clear advantage on tasks that require strong generalization capabilities.


Detailed Task-by-Task Success Rate on Hard Benchmark

BibTeX

@article{zhao2025vla2,
    title={VLA^2: Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation},
    author={Han Zhao and Jiaxuan Zhang and Wenxuan Song and Pengxiang Ding and Donglin Wang},
    journal = {arXiv preprint arXiv:2510.14902},
    year={2025},
}