A lip-reading recognition approach based on long short-term memory

doi:10.7523/j.issn.2095-6134.2018.01.015

Abstract

Abstract: Visual speech information is the important carrier of conversation. However, visual speech informations from different speakers are different due to various appearances of lips, various backgrounds, and various talking ways even the content of the conversation is the same. To address the problem of variety of visual speech information, we propose a new approach for lip-reading recognition based on long short-term memory (LSTM). We compute the positions of lip landmarks which describe the dynamic information of the shape as the features of the lip-reading video, and it has the characteristics of within-class consistency and between-class distinctiveness. Then we use LSTM to encode temporal information, and it learns spatio-temporal features which have the ability of discrimination and generalization. Our approach is evaluated on three public databases (GRID, MRIALC, and OuluVS) for lip-reading recognition of isolated words or phrases in speaker independent experiments. On GRID and MRIALC, the accuracy of our approach is more than 30% highter than that of the conventional approach. On OuluVS, the accuracy of our approach is comparable to state of the art. The experiment results indicate that our lip-reading recognition approach solves the problem of variety of visual speech information effectively.

Key words: lip-reading recognition, long short-term memory, computer vision

CLC Number:

TP391

MA Ning, TIAN Guodong, ZHOU Xi. A lip-reading recognition approach based on long short-term memory[J]. , 2018, 35(1): 109-117.

References

[1] McGurk H, MacDonald J. Hearing lips and seeing voices[J]. Nature, 1976, 264:746-748.
[2] Graves A. Supervised sequence labelling with recurrent neural networks[M]. Berlin:Springer Berlin Heidelberg, 2012.
[3] Zhou Z, Zhao G, Hong X, et al. A review of recent advances in visual speech decoding[J]. Image & Vision Computing, 2014, 32:590-605.
[4] Zhao G, Barnard M, Pietikainen M. Lipreading With Local Spatiotemporal Descriptors[J].IEEE transactions on multimedia, 2009, 11:1254-1265.
[5] Zhou Z, Zhao G, Pietikainen M, Towards a practical lipreading system[C]//Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011:137-144.
[6] Bakry A, Elgammal A. Mkpls:manifold kernel partial least squares for lipreading and speaker identification[C]//Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013:684-691.
[7] Lan Y, Harvey R, Theobald B J, et al. Comparing visual features for lipreading[C]//International Conference on Auditory-Visual Speech Processing. 2009:102-106.
[8] Pei Y, Kim T K, Zha H. Unsupervised random forest manifold alignment for lipreading[C]//Proceedings of the IEEE International Conference on Computer Vision. 2013:129-136.
[9] Rekik A, Ben-Hamadou A, Mahdi W. A new visual speech recognition approach for RGB-D cameras[C]//International Conference Image Analysis and Recognition. Springer International Publishing, 2014:21-28.
[10] Noda K, Yamaguchi Y, Nakadai K, et al. Lipreading using convolutional neural network[J]. Interspeech, 2014:1149-1153.
[11] Graves A, Jaitly N. Towards end-to-end speech recognition with recurrent neural networks[C]//Proceedings of the 31st International Conference on Machine Learning, 2014:1764-1772.
[12] Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[C]//Advances in Neural Information Processing Systems, 2014:3104-3112.
[13] Donahue J, Hendricks L A, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015:2625-2634.
[14] Wand M, Koutn J. Lipreading with long short-term memory[C]//Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016:6115-6119.
[15] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos[C]//Advances in Neural Information Processing Systems, 2014:568-576.
[16] Cooke M, Barker J, Cunningham S, et al. An audio-visual corpus for speech perception and automatic speech recognition[J]. The Journal of the Acoustical Society of America, 2006, 120:2421-2424.
[17] Xiong X, Torre F D L. Supervised descent method and its applications to face alignment[C]//Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013:532-539.
[18] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural computation, 1997, 9:1735-1780.
[19] Yarin G, Ghahramani Z. A theoretically grounded application of dropout in recurrent neural networks[C]//Advances in Neural Information Processing Systems, 2016:1019-1027.