Hierarchical label-guided human parsing

doi:10.7523/j.issn.2095-6134.2020.06.014

Abstract

Abstract: Human parsing is a type of semantic segmentation of different human body parts in an image. It is an emerging task in the field of computer vision. Compared with general objects, human body is much more structured but with wide variations in pose and occlusions caused by wearing. In this paper we present a hierarchical label network (HLNet). Firstly, fine categories are merged into body parts with different granularities to obtain multiple parsing maps for each image. Next, a convolutional neural network with a pyramid feature extraction structure is trained under supervision of these maps. Finally, the hierarchical features are fused together to predict the final parsing results. Experimental results on the LIP dataset show that the proposed algorithm achieves higher parsing accuracy and better segmentation performance, compared with common semantic segmentation algorithms.

Key words: hierarchical labeling, convolutional neural networks(CNN), human parsing, semantic segmentation

CLC Number:

TP391

HU Lina, GAO Shenghua. Hierarchical label-guided human parsing[J]. , 2020, 37(6): 820-827.

References

[1] Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets:deep networks for video classification[C]//Computer Vision & Pattern Recognition. IEEE, 2015:4694-4702.
[2] Lin D, Lu C, Liao R, et al. Learning important spatial pooling regions for scene classification[C]//Computer Vision & Pattern Recognition. IEEE, 2014:3726-3733.
[3] Smolyanskiy N, Kamenev A, Birchfield S. On the importance of stereo for accurate depth estimation:an efficient semi-supervised deep neural network approach[C]//Computer Vision & Pattern Recognition Workshops. IEEE, 2018:1007-1015.
[4] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems. IEEE, 2012:1097-1105.
[5] Ren S, He K, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems. IEEE, 2015:91-99.
[6] Chen L C, Papandreou G, Kokkinos I, et al. Deeplab:semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2018, 40(4):834-848.
[7] Bourdev L, Malik J. Poselets:body part detectors trained using 3d human pose annotations[C]//International Conference on Computer Vision. IEEE, 2009:1365-1372.
[8] Bo Y, Fowlkes C C. Shape-based pedestrian parsing[C]//Computer Vision & Pattern Recognition. IEEE, 2011:2265-2272.
[9] Yamaguchi K, Kiapour M H, Ortiz L E, et al. Parsing clothing in fashion photographs[C]//Proceedings of the 2010 IEEE Conference on Computer Vision & Pattern Recognition. IEEE, 2012:3570-3577.
[10] Rauschert I, Collins R T. A generative model for simultaneous estimation of human body shape and pixel-level segmentation[C]//European Conference on Computer Vision. IEEE, 2012:704-717.
[11] Dong J, Chen Q, Xia W, et al. A deformable mixture parsing model with parselets[C]//Computer Vision & Pattern Recognition. IEEE, 2013:3408-3415.
[12] Liang X, Gong K, Shen X, et al. Look into person:joint body parsing & pose estimation network and a new benchmark[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2018, 41(4):871-885.
[13] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2014, 39(4):640-651.
[14] Chen L C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation[C]//arXiv preprint, arXiv, 2017:1706.05587.
[15] Chen L C, Yang Y, Wang J, et al. Attention to scale:scale-aware semantic image segmentation[C]//Computer Vision & Pattern Recognition. IEEE, 2016:3640-3649.
[16] Nie X, Feng J, Yan S. Mutual learning to adapt for joint human parsing and pose estimation[C]//European Conference on Computer Vision. IEEE, 2018:502-517.
[17] Luo Y, Zheng Z, Zheng L, et al. Macro-micro adversarial network for human parsing[C]//European Conference on Computer Vision. IEEE, 2018:418-434.
[18] Zhao J, Li J, Nie X, et al. Self-supervised neural aggregation networks for human parsing[C]//Computer Vision & Pattern Recognition Workshops. IEEE, 2017:7-15.
[19] Liu T, Ruan T, Huang Z, et al. Devil in the details:towards accurate single and multiple human parsing[C]//arXiv preprint, arXiv, 2018:1809.05996.
[20] Zhao H, Shi J, Qi X, et al. Pyramid scene parsing network[C]//IEEE Conference on Computer Vision & Pattern Recognition. IEEE, 2017:2881-2890.