欢迎访问中国科学院大学学报,今天是

中国科学院大学学报 ›› 2025, Vol. 42 ›› Issue (6): 832-842.DOI: 10.7523/j.ucas.2024.004

• 电子信息与计算机科学 • 上一篇    下一篇

基于融合特征MGCC和CNN-SE-BiGRU的声纹识别

范涛, 詹旭   

  1. 四川轻化工大学自动化与信息工程学院, 四川 宜宾 644000
  • 收稿日期:2023-11-23 修回日期:2024-01-26 发布日期:2024-04-03
  • 通讯作者: 詹旭, E-mail:zhanxuu@163.com

Voiceprint recognition based on fused MGCC and CNN-SE-BiGRU features

FAN Tao, ZHAN Xu   

  1. School of Automation and Information Engineering, Sichuan University of Science and Engineering, Yibin 644000, Sichuan, China
  • Received:2023-11-23 Revised:2024-01-26 Published:2024-04-03

摘要: 针对声纹识别领域单一特征的表征能力、抗噪能力不强,以及传统卷积神经网络(CNN)模型特征表达能力不强、时序特征获取不全面的问题,提出一种梅尔频率倒谱系数(MFCC)和伽马频率倒谱系数(GFCC) 融合的声学特征,与一种新的识别网络模型(CNN-SE-BiGRU),进行声纹识别。首先,对提取的语音MFCC特征和GFCC特征分别进行归一化,根据特征类间区分度,设计合适的权值对MFCC和GFCC特征进行线性加权,得到说话人区分性更强的梅尔伽马倒谱系数 (MGCC);其次,为提升CNN对声纹特征的表达,引入改进的通道特征响应SE-Block模型;最后,在改进的压缩激励卷积网络(CNN-SE-Net) 提取空间特征的基础上进一步通过双向门控循环单元网络(BiGRU)提取时序特征,提升整个网络的性能。实验结果表明,在不同噪声背景下,MGCC声学特征表现出更强的表征能力和更好的鲁棒性,而在MGCC声学特征下CNN-SE-BiGRU模型的平均识别率最高为96.05%,充分证明了所提方法的有效性和鲁棒性。

关键词: 声纹识别, 融合特征, 双向门控循环单元, SE-Block, 卷积神经网络(CNN)

Abstract: In order to solve the problems of single feature, weak representation ability and anti-noise ability in the field of voiceprint recognition, weak feature expression ability of traditional convolutional neural network (CNN) model and incomplete acquisition of temporal features, an acoustic feature fused with Mel frequency cepstral coefficient (MFCC) and Gamma frequency cepstral coefficient (GFCC) was proposed to carry out voiceprint recognition with a novel voiceprint recognition model based on enhanced CNN and bidirectional GRU networks (CNN-SE-BiGRU). Firstly, the extracted MFCC features and GFCC features are normalized, and according to the inter-class discrimination power, appropriate weights are designed to linearly combine the MFCC and GFCC features, and the Mel-gammatone cepstral coefficients (MGCC) with stronger speaker discrimination were obtained. Secondly, in order to improve the expression of CNN to features, an improved channel feature response SE-Block (squeeze and excitation block) model was introduced. Finally, building upon the spatial features extracted by the enhanced squeeze-and-excitation CNN (CNN-SE), the time series features are further extracted through the bidirectional gated recurrent unit network (BiGRU) to improve the performance of the whole network. Experimental results show that the acoustic features of MGCC show stronger characterization ability and better robustness under different noise backgrounds, while the average recognition rate of the CNN-SE-BiGRU model can be 96.05% under MGCC acoustic features, which fully proves the effectiveness and robustness of the proposed method.

Key words: voiceprint recognition, fusion features, bidirectional gated recurrent unit, squeeze and excitation block, convolutional neural network (CNN)

中图分类号: