Abstract
Effective generalization in robotic manipulation requires representations that capture invariant patterns of interaction across environments and tasks. We present a self-supervised framework for learning hierarchical manipulation concepts that encode these invariant patterns through cross-modal sensory correlations and multi-level temporal abstractions without requiring human annotation. Our approach combines a cross-modal correlation network that identifies persistent patterns across sensory modalities with a multi-horizon predictor that organizes representations hierarchically across temporal scales. Manipulation concepts learned through this dual structure enable policies to focus on transferable relational patterns while maintaining awareness of both immediate actions and longer-term goals. Empirical evaluation across simulated benchmarks and real-world deployments demonstrates significant performance improvements with our concept-enhanced policies. Analysis reveals that the learned concepts resemble human-interpretable manipulation primitives despite receiving no semantic supervision. This work advances both the understanding of representation learning for manipulation and provides a practical approach to enhancing robotic performance in complex scenarios.
Hierarchical Manipulation Concepts Improve Generalization
The manipulation concepts are discovered on a simple cup-cleaning task (placing the cup into the container) with only these two color combinations, and then used to regularize ACT policy training on the same task.
The policy enhanced with hierarchical manipulation concepts can handle environmental conditions that differ from the original training scenarios.
Unseen color combination (blue cup, pink container)
Unseen container
Obstacle blocking the cup
Flashlight
Unseen color combination (yellow cup, green container)
Unseen objects
Obstacle blocking the cup
Flashlight
In some situations, policies without manipulation concepts fail completely, while our approach successfully handles these challenging scenarios.
Barrier blocking path
Ours (with manipulation concepts)
Baseline (without manipulation concepts)
Dual cup grasping
Ours (with manipulation concepts)
Baseline (without manipulation concepts)
Real-World Experiment Success Rates
| Unseen Placements | Unseen Color Combination | Unseen Objects | Obstacle blocking the cup | Barrier blocking path | Dual cup grasping | |
|---|---|---|---|---|---|---|
| Without Manipulation Concept | 53.3 | 46.7 | 40.0 | 20.0 | 0.0 | 0.0 |
| With Manipulation Concept | 73.3 | 60.0 | 53.3 | 33.3 | 20.0 | 13.3 |
Hierarchical Manipulation Concepts Improve VLA Fine-Tuning Data Efficiency
Our manipulation concept framework significantly improves data efficiency when integrated with vision-language-action models. Using only 50% of the LIBERO-10 training data, our concept-enhanced approach achieves ~9% higher performance than the baseline OpenVLA-OFT with standard action supervision, demonstrating that hierarchical manipulation concepts enable more efficient learning from less data.
Simulation Results
| L90-90 | InfoCon | XSkill | RPT | All | Next | CLIP | DINOv2 | DecisionNCE | Plain | Ours | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| task | motion | ||||||||||
| ACT | 66.5 | 73.4 | 68.8 | 64.1 | 68.0 | 63.8 | 71.9 | 69.0 | 66.8 | 46.6 | 74.8 |
| (0.8) | (0.8) | (0.8) | (2.0) | (0.4) | (0.5) | (0.3) | (0.1) | (0.8) | (1.9) | (0.8) | |
| DP | 78.2 | 87.7 | 84.3 | 81.5 | 82.6 | 80.7 | 79.4 | 75.7 | 82.7 | 75.1 | 89.6 |
| (0.6) | (0.6) | (0.1) | (0.5) | (0.1) | (0.9) | (0.1) | (0.8) | (0.6) | (0.6) | (0.6) | |
| L90-L | InfoCon | XSkill | RPT | All | Next | CLIP | DINOv2 | DecisionNCE | Plain | Ours | |
| task | motion | ||||||||||
| ACT | 55.5 | 55.0 | 59.0 | 55.5 | 55.0 | 51.0 | 55.0 | 53.0 | 49.3 | 54.0 | 63.0 |
| (0.9) | (1.0) | (1.0) | (0.9) | (1.0) | (1.0) | (1.0) | (1.0) | (0.9) | (0.9) | (1.0) | |
| DP | 75.0 | 73.0 | 61.3 | 79.3 | 83.0 | 67.0 | 63.0 | 58.7 | 52.7 | 34.1 | 89.0 |
| (1.0) | (1.0) | (0.9) | (0.9) | (1.0) | (1.0) | (1.0) | (0.9) | (0.9) | (1.1) | (1.0) | |
| L90-G | InfoCon | XSkill | RPT | All | Next | CLIP | DINOv2 | DecisionNCE | Plain | Ours | |
| task | motion | ||||||||||
| ACT | 67.0 | 77.0 | 75.0 | 69.0 | 71.0 | 77.0 | 77.3 | 70.0 | 75.0 | 57.0 | 81.0 |
| (1.0) | (1.0) | (1.0) | (1.0) | (1.0) | (1.0) | (0.9) | (0.9) | (0.5) | (1.0) | (1.0) | |
| DP | 92.7 | 93.0 | 91.5 | 91.0 | 91.3 | 92.0 | 91.0 | 92.0 | 93.0 | 90.7 | 95.7 |
| (0.9) | (1.0) | (0.9) | (1.0) | (0.9) | (0.9) | (0.7) | (0.8) | (1.0) | (0.9) | (0.7) | |
Hierarchical Task Decomposition via Manipulation Concept Latents (LIBERO)
Hierarchical Task Decomposition via Manipulation Concept Latents (BridgeData V2)
BibTeX
@inproceedings{
liu2025textithimacon,
title={\${\textbackslash}textit\{HiMaCon:\}\$ Discovering Hierarchical Manipulation Concepts from Unlabeled Multi-Modal Data},
author={Ruizhe Liu and Pei Zhou and Qian Luo and Li Sun and Jun CEN and Yibing Song and Yanchao Yang},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=2aIoEG2Hwz}
}