欢迎访问中国科学院大学学报,今天是

中国科学院大学学报

• •    

基于融合特征MGCC和CNN-SE-BiGRU的声纹识别研究1

范涛, 詹旭   

  1. 四川轻化工大学自动化与信息工程学院,四川 宜宾 644000
  • 收稿日期:2023-11-23 修回日期:2024-01-26 发布日期:2024-04-03
  • 通讯作者: E-mail:zhanxuu@163.com

Research on voiceprint recognition based on fusion features MGCC and CNN-SE-BiGRU

FAN Tao, ZHAN Xu   

  1. School of Automation and Information Engineering, Sichuan University of Science and Engineering, Yibin 644000, Sichuan,China
  • Received:2023-11-23 Revised:2024-01-26 Published:2024-04-03

摘要: 针对声纹识别领域单一特征的表征能力和抗噪能力不强,和传统CNN模型特征表达能力不强和时序特征获取不全面的问题,提出一种梅尔频率倒谱系数(MFCC)和伽马频率倒谱系数(GFCC) 融合的声学特征,与一种新的深度网络结构-压缩激励机制的卷积神经网络和双向门控循环单元网络集成网络(CNN-SE-BiGRU),进行声纹识别。首先,对提取的语音MFCC特征和GFCC特征分别进行归一化,根据特征类间区分度,设计合适的权值对MFCC和GFCC两个特征用线性加权,得到说话人区分性更强的梅尔伽马倒谱系数 (mel-gammatone cepstral coefficients,MGCC);其次,为提升CNN对声纹特征的表达,引入改进的通道特征响应SE-Block (squeeze and excitation block)模型;最后,在改进的压缩激励卷积网络(CNN-SE-Net) 提取空间特征的基础上进一步通过双向门控循环单元网络(BiGRU)提取时序特征,提升整个网络的性能。实验结果表明,在不同噪声背景,MGCC声学特征表现出更强的表征能力和更好的鲁棒性,而在MGCC声学特征下CNN-SE-BiGRU模型的平均识别率最高为96.05%,充分证明了所提方法的有效性和鲁棒性。

关键词: 声纹识别, 融合特征, 双向门控循环单元, SE-Block

Abstract: In order to solve the problems of single feature, weak representation ability and anti-noise ability in the field of voiceprint recognition, weak feature expression ability of traditional CNN model and incomplete acquisition of temporal features, an acoustic feature fused with mel frequency cepstral coefficient (MFCC) and gamma frequency cepstrum coefficient (GFCC) was proposed tocarry out voiceprint recognition with a new deep network structure-compression excitation mechanism convolutional neural network and bidirectional gated recurrent unit network integration network (CNN-SE-BiGRU). Firstly, the extracted speech MFCC features and GFCC features were normalized respectively, and according to the discrimination between feature classes, appropriate weights were designed to linearly weight the MFCC and GFCC features, and the mel-gammatone cepstral coefficients (MGCC) with stronger speaker discrimination were obtained. Secondly, in order to improve the expression of CNN to features, an improved channel feature response SE-Block (squeeze and excitation block) model was introduced. Finally, on the basis of the improved compressed excitation convolutional network (CNN-SE-Net) to extract spatial features, the time series features are further extracted through the bidirectional gated recurrent unit network (BiGRU) to improve the performance of the whole network. Experimental results show that the acoustic features of MGCC show stronger characterization ability and better robustness under different noise backgrounds, while the average recognition rate of the CNN-SE-BiGRU model is the highest under MGCC acoustic features of 96.05%, which fully proves the effectiveness and robustness of the proposed method.

Key words: voiceprint recognition, fusion features, bidirectional gated recirculating unit, squeeze and excitation block

中图分类号: