HiMaCon: Discovering Hierarchical Manipulation Concepts from Unlabeled Multi-Modal Data

1HKU Musketeers Foundation Institute of Data Science, The University of Hong Kong
2Department of Electrical and Electronic Engineering, The University of Hong Kong
3DAMO Academy, Alibaba Group
4Transcengram

NeurIPS 2025

*First Author

We train a neural network to learn Hierarchical Manipulation Concept Latents from demonstrations in a fully self-supervised manner. By adjusting a similarity threshold between latents, we can segment demonstrations into subgoals at different levels of granularity—from high-level processes to finer-grained steps—all without requiring any human labels.

Abstract

Effective generalization in robotic manipulation requires representations that capture invariant patterns of interaction across environments and tasks. We present a self-supervised framework for learning hierarchical manipulation concepts that encode these invariant patterns through cross-modal sensory correlations and multi-level temporal abstractions without requiring human annotation. Our approach combines a cross-modal correlation network that identifies persistent patterns across sensory modalities with a multi-horizon predictor that organizes representations hierarchically across temporal scales. Manipulation concepts learned through this dual structure enable policies to focus on transferable relational patterns while maintaining awareness of both immediate actions and longer-term goals. Empirical evaluation across simulated benchmarks and real-world deployments demonstrates significant performance improvements with our concept-enhanced policies. Analysis reveals that the learned concepts resemble human-interpretable manipulation primitives despite receiving no semantic supervision. This work advances both the understanding of representation learning for manipulation and provides a practical approach to enhancing robotic performance in complex scenarios.

Hierarchical Manipulation Concepts Improve Generalization

The manipulation concepts are discovered on a simple cup-cleaning task (placing the cup into the container) with only these two color combinations, and then used to regularize ACT policy training on the same task.



The policy enhanced with hierarchical manipulation concepts can handle environmental conditions that differ from the original training scenarios.


Unseen color combination (blue cup, pink container)

Unseen container

Obstacle blocking the cup

Flashlight



Unseen color combination (yellow cup, green container)

Unseen objects

Obstacle blocking the cup

Flashlight



In some situations, policies without manipulation concepts fail completely, while our approach successfully handles these challenging scenarios.


Barrier blocking path

Ours (with manipulation concepts)

Baseline (without manipulation concepts)


Dual cup grasping

Ours (with manipulation concepts)

Baseline (without manipulation concepts)



Real-World Experiment Success Rates

Unseen Placements Unseen Color Combination Unseen Objects Obstacle blocking the cup Barrier blocking path Dual cup grasping
Without Manipulation Concept 53.3 46.7 40.0 20.0 0.0 0.0
With Manipulation Concept 73.3 60.0 53.3 33.3 20.0 13.3

Hierarchical Manipulation Concepts Improve VLA Fine-Tuning Data Efficiency

First research result visualization

Our manipulation concept framework significantly improves data efficiency when integrated with vision-language-action models. Using only 50% of the LIBERO-10 training data, our concept-enhanced approach achieves ~9% higher performance than the baseline OpenVLA-OFT with standard action supervision, demonstrating that hierarchical manipulation concepts enable more efficient learning from less data.

Simulation Results

Evaluation of manipulation concept discovery methods across different task settings.
Success rates (%) of ACT and Diffusion Policy (DP) models when enhanced with manipulation concepts from various discovery methods. All concept encoders were trained only on LIBERO-90, and evaluated on: original tasks (L90-90), novel long-horizon compositions (L90-L), and entirely new environments (L90-G). Values in parentheses show standard deviations across 4 seeds. Bold and underlined values indicate best and second-best results.
L90-90 InfoCon XSkill RPT All Next CLIP DINOv2 DecisionNCE Plain Ours
task motion
ACT 66.5 73.4 68.8 64.1 68.0 63.8 71.9 69.0 66.8 46.6 74.8
(0.8) (0.8) (0.8) (2.0) (0.4) (0.5) (0.3) (0.1) (0.8) (1.9) (0.8)
DP 78.2 87.7 84.3 81.5 82.6 80.7 79.4 75.7 82.7 75.1 89.6
(0.6) (0.6) (0.1) (0.5) (0.1) (0.9) (0.1) (0.8) (0.6) (0.6) (0.6)
L90-L InfoCon XSkill RPT All Next CLIP DINOv2 DecisionNCE Plain Ours
task motion
ACT 55.5 55.0 59.0 55.5 55.0 51.0 55.0 53.0 49.3 54.0 63.0
(0.9) (1.0) (1.0) (0.9) (1.0) (1.0) (1.0) (1.0) (0.9) (0.9) (1.0)
DP 75.0 73.0 61.3 79.3 83.0 67.0 63.0 58.7 52.7 34.1 89.0
(1.0) (1.0) (0.9) (0.9) (1.0) (1.0) (1.0) (0.9) (0.9) (1.1) (1.0)
L90-G InfoCon XSkill RPT All Next CLIP DINOv2 DecisionNCE Plain Ours
task motion
ACT 67.0 77.0 75.0 69.0 71.0 77.0 77.3 70.0 75.0 57.0 81.0
(1.0) (1.0) (1.0) (1.0) (1.0) (1.0) (0.9) (0.9) (0.5) (1.0) (1.0)
DP 92.7 93.0 91.5 91.0 91.3 92.0 91.0 92.0 93.0 90.7 95.7
(0.9) (1.0) (0.9) (1.0) (0.9) (0.9) (0.7) (0.8) (1.0) (0.9) (0.7)

Hierarchical Task Decomposition via Manipulation Concept Latents (LIBERO)

Hierarchical Task Decomposition via Manipulation Concept Latents (BridgeData V2)

BibTeX

@inproceedings{
        liu2025textithimacon,
        title={\${\textbackslash}textit\{HiMaCon:\}\$ Discovering Hierarchical Manipulation Concepts from Unlabeled Multi-Modal Data},
        author={Ruizhe Liu and Pei Zhou and Qian Luo and Li Sun and Jun CEN and Yibing Song and Yanchao Yang},
        booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
        year={2025},
        url={https://openreview.net/forum?id=2aIoEG2Hwz}
      }