欢迎访问中国科学院大学学报,今天是

中国科学院大学学报 ›› 2018, Vol. 35 ›› Issue (1): 109-117.DOI: 10.7523/j.issn.2095-6134.2018.01.015

• 信息与电子科学 • 上一篇    下一篇

一种基于long short-term memory的唇语识别方法

马宁1,2, 田国栋2, 周曦2   

  1. 1. 中国科学院大学, 北京 100049;
    2. 中国科学院重庆绿色智能技术研究院, 重庆 400714
  • 收稿日期:2016-11-23 修回日期:2017-03-15 发布日期:2018-01-15
  • 通讯作者: 马宁
  • 基金资助:
    国家自然科学基金(61472386,61502444)和中国科学院战略性先导科技专项子课题(XDA06040103)资助

A lip-reading recognition approach based on long short-term memory

MA Ning1,2, TIAN Guodong2, ZHOU Xi2   

  1. 1. University of Chinese Academy of Sciences, Beijing 100049, China;
    2. Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China
  • Received:2016-11-23 Revised:2017-03-15 Published:2018-01-15

摘要: 唇动视觉信息是说话内容的重要载体。受嘴唇外观、背景信息和说话习惯等影响,即使说话者说相同的内容,唇动视觉信息也会相差很大。为解决唇语视觉信息多样性的问题,提出一种基于long short-term memory(LSTM)的新的唇语识别方法。以往大多数的方法从嘴唇外表信息入手。本方法用嘴唇关键点坐标描述嘴唇形变信息作为唇语视频的特征,它具有类内一致性和类间区分性的特点。然后利用LSTM对特征进行时序编码,它能学习具有区分性和泛化性的空间-时序特征。在公开的唇语数据集GRID、MIRACL-VC和OuluVS上对本方法做了针对分割的单词或短语的说话者独立的唇语识别评估。在GRID和MIRACL-VC上,本方法的准确率比传统方法至少高30%;在OuluVS上,本方法的准确率接近于最优结果。以上实验结果表明,本文提出的基于LSTM的唇语识别方法有效地解决了唇语视觉信息多样性的问题。

关键词: 唇语识别, long short-term memory, 计算机视觉

Abstract: Visual speech information is the important carrier of conversation. However, visual speech informations from different speakers are different due to various appearances of lips, various backgrounds, and various talking ways even the content of the conversation is the same. To address the problem of variety of visual speech information, we propose a new approach for lip-reading recognition based on long short-term memory (LSTM). We compute the positions of lip landmarks which describe the dynamic information of the shape as the features of the lip-reading video, and it has the characteristics of within-class consistency and between-class distinctiveness. Then we use LSTM to encode temporal information, and it learns spatio-temporal features which have the ability of discrimination and generalization. Our approach is evaluated on three public databases (GRID, MRIALC, and OuluVS) for lip-reading recognition of isolated words or phrases in speaker independent experiments. On GRID and MRIALC, the accuracy of our approach is more than 30% highter than that of the conventional approach. On OuluVS, the accuracy of our approach is comparable to state of the art. The experiment results indicate that our lip-reading recognition approach solves the problem of variety of visual speech information effectively.

Key words: lip-reading recognition, long short-term memory, computer vision

中图分类号: