Abstract:RGB-D saliency object detection has received increasing attention due to its effectiveness and ease of capturing depth cues.Existing work usually focuses on learning shared representations through various fusion strategies,and few approaches explicitly consider how to maintain the modal features of RGB and depth.In this paper,we propose a cross- modal fusion network that maintains the modalities of RGB and depth for RGB-D salient object detection,and improves the salient detection performance by exploring the shared information as well as the properties of RGB and depth modalities.Specifically,an RGB modal,a deep modal network,and a shared learning network are used to generateRGB and deep modal saliency prediction maps as well as shared saliency prediction maps.A cross-modal feature integrate module is proposed to fuse cross-modal features in the shared learning network,which are then propagated to the next layer for integrating cross level information.Besides,we propose a multi-modal feature aggregation module to integrate the modality specific features from each individual decoder into the shared decoder,which can provide rich complementary multi-modal information to boost the saliency detection performance.Further,a skip connection is used to combine hierarchical features between the encoder and decoder layers.Experiments with ten state-of-the-art methods on four benchmark datasets show that the method in this paper outperforms other state-of-the-art methods.