The present invention relates to a network apparatus using one indoor color image and a depth image thereof to perform deep learning-based pixel unit semantic segmentation. According to the present invention, the network apparatus comprises: a first residual network module including a plurality of steps, which are connected from an upper step to a lower step and formed with a convolutional neural network (CNN) for each step, to extract gradual feature information from a color image; a second residual network module including a plurality of steps, which are connected from an upper step to a lower step and formed with a CNN for each step, to extract gradual feature information from a depth image; and a multimodal feature fusion network (MMFNet) module fusing the feature information extracted from each step of the first and second residual network modules. The MMFNet module is formed by sequentially connecting: a convolution block reducing dimensions of the feature information with respect to color and depth feature information extracted from a step corresponding to the first and second residual network modules for each step in order to smoothen a rapid increase of a parameter; two residual convolution units performing nonlinear modification for shape combination; and a convolution block adaptively combining feature information of different types and adjusting a scale of a feature value for addition. Moreover, scaled color feature information and scaled depth feature information are combined by addition. The present invention effectively extracts and combines various dimensions of feature information at the same time such that the feature information can be efficiently learned.
展开▼