欢迎访问中国科学院大学学报,今天是

中国科学院大学学报 ›› 2005, Vol. 22 ›› Issue (2): 140-146.DOI: 10.7523/j.issn.2095-6134.2005.2.003

• 论文 • 上一篇    下一篇

一种改进后的递增式语音语料抽选算法

宁振江, 杜利民   

  1. 中国科学院声学研究所语音交互技术研究实验室, 北京 100080
  • 收稿日期:2004-04-12 修回日期:2004-06-08 发布日期:2005-03-15
  • 通讯作者: 杜利民,E-mail:(ningzhj,dulm)@iis.ac.cn
  • 基金资助:

    国家973重点基础研究发展项目(G1998030505)资助

An Improved Incremental Approach to Speech Corpus Selection

NING Zhen-Jiang, DU Li-Min   

  1. Labs for Speech Interaction Technology Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100080, China
  • Received:2004-04-12 Revised:2004-06-08 Published:2005-03-15

摘要:

提出一种改进后的递增式语料抽选方法,通过在递增的每个抽选阶段对所获语料进行二次筛选,加强对语料中语音单元的平衡控制能力.该方法的实验佐证以10年《人民日报》等报刊杂志为待选语料,共约2000万个句子,包括8472个前后语境子音.从这批待选语料中抽选获得的语料仅包括17865个句子,但对出现次数超过10的前后语境子音的覆盖率却达到了943%,语料中前后语境子音的分布方差为0.18×10-3.该方法除了在音子单元的覆盖率和分布均匀化方面的优异性能外,还具有较低的计算代价和存储代价,适宜于从报刊杂志、互联网等海量的真实语料中抽选语音识别声学模型的语料.

关键词: 语音识别, 声学模型, 语料抽选

Abstract:

In this paper, a novel incremental corpus selection approach is proposed, which can control the balance of phone units in the selected corpus more effectively through a process of erasing redundant sentences at each selection phrase.In our experiments, we employ a huge original data source consists of about 20 million sentences and 847-phone-contexts, while corpus generated from this data source has made up of 17865 sentences and has an coverage of 94.3 % according to phone-contexts which appearances frequency in the selected corpus are more than 10.In addition, it achieves a relatively low distribution variance of 0.18 ×10-3.Experiment results show that our approach is much better than traditional algorithms not only in phonetic units coverage but in phonetic units distribution variances.Moreover, Our algorithm has low computation complexity and memory cost.

Key words: speech recognition, acoustic model, corpus selection

中图分类号: