Comparison of 2D and 3D Modalities: CLAR leverages 3D point clouds to resolve multi-view ambiguity.
The spatial information inherent in 3D point clouds is crucial for robotic manipulation. However, existing 3D pre-training methods face a fundamental trade-off: Masked Autoencoding (MAE) excels at capturing spatial-geometric features but lacks semantics, whereas contrastive learning is ill-suited for the fine-grained details required for manipulation tasks.
To address these challenges, we propose CLAR, a novel 3D pre-training framework that synergizes global understanding with fine-grained local alignment. CLAR unifies MAE with global cross-modal contrastive learning and introduces an adaptive alignment mechanism leveraging deformable attention to force precise 3D-to-2D correspondences.
(a) The CLAR Pre-training Framework: We enhance spatial understanding via MAE and semantic comprehension through contrastive learning. To capture the fine-grained local details, we supplement the global contrastive loss with an adaptive local feature alignment mechanism using deformable attention.
(b) Resolving Contextual Mismatch: Traditional point cloud cropping removes background context, leading to feature discrepancy. Our adaptive local alignment strategy ensures learning focuses on meaningful, shared information between modalities.
CLAR achieves state-of-the-art success rates (82.6% on MetaWorld and 82.0% on RLBench), significantly outperforming 2D baselines and existing 3D pre-training methods.
Qualitative results across five real-world tasks.
83.0% Mean Success Rate
CLAR significantly outperforms baselines in challenging real-world tasks. Its unified 3D pre-training allows it to better comprehend object geometries and spatial relations, resolving the modality mismatch common in previous works.
@article{cui2025clar,
title={CLAR: Learning 3D Representations for Robotic Manipulation by Fusing Masked Reconstruction with Multi-Level Contrastive Alignment},
author={Cui, Wenbo and Zhao, Chengyang and Chen, Yuhui and Li, Haoran and Zhang, Zhizheng and Zhao, Dongbin and Wang, He},
journal={arXiv preprint arXiv:2507.08262},
year={2025}
}