CLAR: Learning 3D Representations for Robotic Manipulation by Fusing Masked Reconstruction with Multi-Level Contrastive Alignment

Wenbo Cui* 1,2,3, Chengyang Zhao* 4, Yuhui Chen1,2, Haoran Li1,2, Zhizheng Zhang5, Dongbin Zhao1,2, He Wang† 5,6
1CASIA, 2UCAS, 3BAAI, 4CMU, 5Galbot, 6Peking University
*Indicates Equal Contribution, Corresponding Author

Abstract

2D vs 3D Modalities

Comparison of 2D and 3D Modalities: CLAR leverages 3D point clouds to resolve multi-view ambiguity.

The spatial information inherent in 3D point clouds is crucial for robotic manipulation. However, existing 3D pre-training methods face a fundamental trade-off: Masked Autoencoding (MAE) excels at capturing spatial-geometric features but lacks semantics, whereas contrastive learning is ill-suited for the fine-grained details required for manipulation tasks.

To address these challenges, we propose CLAR, a novel 3D pre-training framework that synergizes global understanding with fine-grained local alignment. CLAR unifies MAE with global cross-modal contrastive learning and introduces an adaptive alignment mechanism leveraging deformable attention to force precise 3D-to-2D correspondences.

Methodology

CLAR Pipeline

(a) The CLAR Pre-training Framework: We enhance spatial understanding via MAE and semantic comprehension through contrastive learning. To capture the fine-grained local details, we supplement the global contrastive loss with an adaptive local feature alignment mechanism using deformable attention.

(b) Resolving Contextual Mismatch: Traditional point cloud cropping removes background context, leading to feature discrepancy. Our adaptive local alignment strategy ensures learning focuses on meaningful, shared information between modalities.

Experimental Results

Simulation Benchmarks

Simulation Results

CLAR achieves state-of-the-art success rates (82.6% on MetaWorld and 82.0% on RLBench), significantly outperforming 2D baselines and existing 3D pre-training methods.


Real-World Robot Experiments (Franka Emika)

Real World Visualizations

Qualitative results across five real-world tasks.

Success Rate Chart

83.0% Mean Success Rate

CLAR significantly outperforms baselines in challenging real-world tasks. Its unified 3D pre-training allows it to better comprehend object geometries and spatial relations, resolving the modality mismatch common in previous works.

BibTeX

@article{cui2025clar,
  title={CLAR: Learning 3D Representations for Robotic Manipulation by Fusing Masked Reconstruction with Multi-Level Contrastive Alignment},
  author={Cui, Wenbo and Zhao, Chengyang and Chen, Yuhui and Li, Haoran and Zhang, Zhizheng and Zhao, Dongbin and Wang, He},
  journal={arXiv preprint arXiv:2507.08262},
  year={2025}
}