© The whole layers of view module. (d) The whole layers of fusion module.
The View Convolution Network (VCN) architecture consists of three main modules: the dimension module, the view module, and the fusion module. Fig. 1(a) shows the structure of each module in VCN.
The dimension module is responsible for reducing the input data’s dimensionality to a lower-dimensional representation. This is achieved by using a combination of convolutional and fully connected layers. Fig. 1(b) illustrates the entire set of layers in the dimension module.
The view module takes as input multiple views or modalities of an object, such as RGB images, depth maps, or thermal images. Each modality is processed independently through a set of convolutional layers before being concatenated into a single feature map. Fig. 1© displays all layers in the view module.
The fusion module combines the low-dimensional representations generated by the dimension module with the multi-modal features extracted from the view module to produce a final output. This is accomplished through several fully connected layers that learn to weight and combine information from each modality appropriately. Fig. 1(d) presents all layers in the fusion module.
Overall, VCN provides an effective way to process multi-modal data by leveraging both low- and high-level features across different views or modalities of an object simultaneously.