一种改进后的递增式语音语料抽选算法

doi:10.7523/j.issn.2095-6134.2005.2.003

中国科学院大学学报 ›› 2005, Vol. 22 ›› Issue (2): 140-146.DOI: 10.7523/j.issn.2095-6134.2005.2.003

一种改进后的递增式语音语料抽选算法

宁振江, 杜利民

中国科学院声学研究所语音交互技术研究实验室, 北京 100080

收稿日期:2004-04-12 修回日期:2004-06-08 发布日期:2005-03-15
通讯作者: 杜利民,E-mail:(ningzhj,dulm)@iis.ac.cn
基金资助:
国家973重点基础研究发展项目(G1998030505)资助

An Improved Incremental Approach to Speech Corpus Selection

NING Zhen-Jiang, DU Li-Min

Labs for Speech Interaction Technology Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100080, China

Received:2004-04-12 Revised:2004-06-08 Published:2005-03-15

摘要/Abstract

摘要：

提出一种改进后的递增式语料抽选方法,通过在递增的每个抽选阶段对所获语料进行二次筛选,加强对语料中语音单元的平衡控制能力.该方法的实验佐证以10年《人民日报》等报刊杂志为待选语料,共约2000万个句子,包括8472个前后语境子音.从这批待选语料中抽选获得的语料仅包括17865个句子,但对出现次数超过10的前后语境子音的覆盖率却达到了943%,语料中前后语境子音的分布方差为0.18×10^-3.该方法除了在音子单元的覆盖率和分布均匀化方面的优异性能外,还具有较低的计算代价和存储代价,适宜于从报刊杂志、互联网等海量的真实语料中抽选语音识别声学模型的语料.

关键词: 语音识别, 声学模型, 语料抽选

Abstract:

In this paper, a novel incremental corpus selection approach is proposed, which can control the balance of phone units in the selected corpus more effectively through a process of erasing redundant sentences at each selection phrase.In our experiments, we employ a huge original data source consists of about 20 million sentences and 847-phone-contexts, while corpus generated from this data source has made up of 17865 sentences and has an coverage of 94.3 % according to phone-contexts which appearances frequency in the selected corpus are more than 10.In addition, it achieves a relatively low distribution variance of 0.18 ×10^-3.Experiment results show that our approach is much better than traditional algorithms not only in phonetic units coverage but in phonetic units distribution variances.Moreover, Our algorithm has low computation complexity and memory cost.

Key words: speech recognition, acoustic model, corpus selection

中图分类号:

TP391

宁振江, 杜利民. 一种改进后的递增式语音语料抽选算法[J]. 中国科学院大学学报, 2005, 22(2): 140-146.

NING Zhen-Jiang, DU Li-Min . An Improved Incremental Approach to Speech Corpus Selection[J]. , 2005, 22(2): 140-146.

参考文献

[1] Wu SM, Liau JS.On the creation of mandarin phonetic balanced sentences.Telecommunication Journal, 1990, 19(1):79 ～ 87

[2] Jyh-Shing, Shyuu, Wang JF.An algorithm for automatic generation of mandarin phonetic balanced corpus.ICSLP'98, Sydney,Australia, 1998,7 :3175 ～ 3178

[3] Wang HM, Chang YC, Lee LS.Automatic selection of Chinese syllable-balanced sentences from Chinese text corpus.ROCLING-VI, 1993, 195～ 206

[4] Gao YQ, Hon HW, Lin ZW, et al.Tangerine :a large vocabulary mandarin di ctation system.ICASSP'95, Detroit, Michigan, USA, 1995, 1(1):77 ～ 80

[5] Li M, Jochen J, Yu TC.An incremental approach to selection of well balanced corpus.In :8th Aust.Int.Conf.Speech Sci.&Tech., 2000, 440～ 444

[6] Wu H, Xu B, Huang TY.Automatic corpus selecting algorithm based on triphone models.Journal of Software, 2000, 11(2):271 ～ 276 (in Chinese with English abstract)

[7] Donald EK.The Art of Computer Programming Volume 3.Sorting and Searching, Second Edition, Massachusetts :Addi son-Wesley, 1998

[8] Zhu DX.Lectures on Syntax.Beijing :Commercial Press, 1982 (in Chinese)

[9] Wang XF, Du LM.A method of sentence segment ation that check all overlapping ambiguity.Acta Electronica S inica, 2004, 32(1):50 ～ 54 (in Chinese with English abstract)

[10] Chen XH.Automatic Analysis of Contemporary Chinese Using Visual C++.Beijing :Bei jing Language and Culture University Press, 1999 (in Chinese)

[6] 吴华, 徐波, 黄泰翼.基于三音子模型的语料选择算法.软件学报, 2000.11(2):271 ～ 276

[8] 朱德熙.语法讲义.北京:商务印书馆,1982

[9] 王显芳, 杜利民.一种能够检测所有交叉歧义的汉语句子切分算法.电子学报, 2004, 32(1):50 ～ 54

[10] 陈小荷.现代汉语自动分析-Vi sual C++实现.北京:北京语言文化大学出版社, 1999

一种改进后的递增式语音语料抽选算法

An Improved Incremental Approach to Speech Corpus Selection

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 5

编辑推荐

Metrics

本文评价

访问统计

联系我们

[1]	丁佳伟, 刘加, 张卫强, 冯运波, 刘利军, 于乐. WFST解码器词图生成算法中的非活跃节点检测与内存优化[J]. 中国科学院大学学报, 2019, 36(1): 109-114.
[2]	吴蔚澜, 蔡猛, 田垚, 杨晓昊, 陈振锋, 刘加, 夏善红. 低数据资源条件下基于Bottleneck特征与SGMM模型的语音识别系统[J]. 中国科学院大学学报, 2015, 32(1): 97-102.
[3]	唐步天, 郭立, 刘振华. 利用MFCC的语音信息隐藏方法[J]. 中国科学院大学学报, 2008, 25(3): 386-394.
[4]	张明新; 倪宏; 张东滨; 陈国平. 基于PMC方法的鲁棒声学模型研究[J]. 中国科学院大学学报, 2006, 23(5): 660-664.
[5]	刘斌; 杜利民. 基于MPC5200的嵌入式非特定人连续语音识别系统[J]. 中国科学院大学学报, 2006, 23(2): 174-177.