[1] McGurk H, MacDonald J. Hearing lips and seeing voices[J]. Nature, 1976, 264:746-748.
[2] Graves A. Supervised sequence labelling with recurrent neural networks[M]. Berlin:Springer Berlin Heidelberg, 2012.
[3] Zhou Z, Zhao G, Hong X, et al. A review of recent advances in visual speech decoding[J]. Image & Vision Computing, 2014, 32:590-605.
[4] Zhao G, Barnard M, Pietikainen M. Lipreading With Local Spatiotemporal Descriptors[J].IEEE transactions on multimedia, 2009, 11:1254-1265.
[5] Zhou Z, Zhao G, Pietikainen M, Towards a practical lipreading system[C]//Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011:137-144.
[6] Bakry A, Elgammal A. Mkpls:manifold kernel partial least squares for lipreading and speaker identification[C]//Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013:684-691.
[7] Lan Y, Harvey R, Theobald B J, et al. Comparing visual features for lipreading[C]//International Conference on Auditory-Visual Speech Processing. 2009:102-106.
[8] Pei Y, Kim T K, Zha H. Unsupervised random forest manifold alignment for lipreading[C]//Proceedings of the IEEE International Conference on Computer Vision. 2013:129-136.
[9] Rekik A, Ben-Hamadou A, Mahdi W. A new visual speech recognition approach for RGB-D cameras[C]//International Conference Image Analysis and Recognition. Springer International Publishing, 2014:21-28.
[10] Noda K, Yamaguchi Y, Nakadai K, et al. Lipreading using convolutional neural network[J]. Interspeech, 2014:1149-1153.
[11] Graves A, Jaitly N. Towards end-to-end speech recognition with recurrent neural networks[C]//Proceedings of the 31st International Conference on Machine Learning, 2014:1764-1772.
[12] Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[C]//Advances in Neural Information Processing Systems, 2014:3104-3112.
[13] Donahue J, Hendricks L A, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015:2625-2634.
[14] Wand M, Koutn J. Lipreading with long short-term memory[C]//Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016:6115-6119.
[15] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos[C]//Advances in Neural Information Processing Systems, 2014:568-576.
[16] Cooke M, Barker J, Cunningham S, et al. An audio-visual corpus for speech perception and automatic speech recognition[J]. The Journal of the Acoustical Society of America, 2006, 120:2421-2424.
[17] Xiong X, Torre F D L. Supervised descent method and its applications to face alignment[C]//Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013:532-539.
[18] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural computation, 1997, 9:1735-1780.
[19] Yarin G, Ghahramani Z. A theoretically grounded application of dropout in recurrent neural networks[C]//Advances in Neural Information Processing Systems, 2016:1019-1027. |