Abstract:Point cloud self-supervised representation learning is conducted in an unlabeled pre-training manner, exploring the structural relationships of 3D topological geometric spaces and capturing feature representations. This approach can be applied to downstream tasks, such as point cloud classification, segmentation, and object detection. To enhance the generalization and robustness of the pretrained models, this study proposes a multi-modal self-supervised method for learning point cloud representations. The method is based on bidirectional fit mask reconstruction and comprises three main components: (1) The “bad teacher” model, guided by the inverse density scale, employs a bidirectional fit strategy that utilizes inverse density noise representation and global feature representation to expedite the convergence of the mask region towards the true value. (2) The StyleGAN-based auxiliary point cloud generation model, grounded in local geometric information, generates stylized point clouds and fuses them with mask reconstruction results while adhering to threshold constraints. The objective is to mitigate the adverse effects of noise on representation learning during the reconstruction process. (3) The multi-modal teacher model aims to enhance the diversity of the 3D feature space and prevent the collapse of modal information. It relies on the triple feature contrast loss function to fully extract the latent information contained in the point cloud-image-text sample space. The proposed method is evaluated on ModelNet, ScanObjectNN, and ShapeNet datasets for fine-tuning tasks. Experimental results demonstrate that the pretrained model achieves state-of-the-art performance in various point cloud recognition tasks, including point cloud classification, linear support vector machine classification, few-shot classification, zero-shot classification, and part segmentation.