NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models

Ziyue Zhu*,1 , Shangyang Wu*,1 , Shuai Zhao†,1 , Zhiqiu Zhao1 , Shengjie Li1 , Yi Wang2 , Fang Li2 , Haoran Luo†,2
1 Beijing University of Posts and Telecommunications    2 Nanyang Technological University
* Equal contribution   † Corresponding authors
Paper Code Model Dataset
NS-VLA Pipeline Overview
Figure: NS-VLA pipeline overview — Neuro encoder, Symbolic solver, and Online RL optimization.

Figure 1. An example of the NS-VLA pipeline to execute instruction-conditioned manipulation by orchestrating symbolic primitives and sparse action chunks.

Abstract

Vision-Language-Action (VLA) models are formulated to ground instructions in visual context and generate action sequences for robotic manipulation. Despite recent progress, VLA models still face challenges in learning related and reusable primitives, reducing reliance on large-scale data and complex architectures, and enabling exploration beyond demonstrations.

To address these challenges, we propose a novel Neuro-Symbolic Vision-Language-Action (NS-VLA) framework via online reinforcement learning (RL). It introduces a symbolic encoder to embed vision and language features and extract structured primitives, utilizes a symbolic solver for data-efficient action sequencing, and leverages online RL to optimize generation via expansive exploration.

Experiments on robotic manipulation benchmarks demonstrate that NS-VLA outperforms previous methods in both one-shot training and data-perturbed settings, while simultaneously exhibiting superior zero-shot generalizability, high data efficiency, and an expanded exploration space.

Highlights

🧩

Neuro-Symbolic Encoding

Extracts structured primitives from vision-language inputs, capturing reusable action structures across tasks.

âš¡

Symbolic Solver

Lightweight solver with visual token sparsification for data-efficient, real-time action generation.

🔄

Online RL Optimization

GRPO-based training with primitive-segmented rewards enables exploration beyond expert demonstrations.

📊

State-of-the-Art Results

98.6% SR on LIBERO, 79.4% on LIBERO-Plus, and 91.2% 5-task SR on CALVIN ABC→D.

Main Results

NS-VLA achieves state-of-the-art performance across three benchmark settings:

Method Params LIBERO (Full) LIBERO (1-shot) LIBERO-Plus
OpenVLA7B76.535.715.6
OpenVLA-OFT7B97.148.969.6
π₀3B94.237.453.6
UniVLA7B95.255.142.9
VLA-Adapter0.5B97.365.358.9
NS-VLA (Ours) 2B 98.6 69.1 79.4

Citation

If you find our work useful, please consider citing:

@article{zhu2026nsvla,
  title={NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models},
  author={Zhu, Ziyue and Wu, Shangyang and Zhao, Shuai and Zhao, Zhiqiu and Li, Shengjie and Wang, Yi and Li, Fang and Luo, Haoran},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}