基于融合特征MGCC和CNN-SE-BiGRU的声纹识别

doi:10.7523/j.ucas.2024.004

摘要/Abstract

摘要： 针对声纹识别领域单一特征的表征能力、抗噪能力不强，以及传统卷积神经网络(CNN)模型特征表达能力不强、时序特征获取不全面的问题，提出一种梅尔频率倒谱系数(MFCC)和伽马频率倒谱系数(GFCC) 融合的声学特征，与一种新的识别网络模型(CNN-SE-BiGRU)，进行声纹识别。首先，对提取的语音MFCC特征和GFCC特征分别进行归一化，根据特征类间区分度，设计合适的权值对MFCC和GFCC特征进行线性加权，得到说话人区分性更强的梅尔伽马倒谱系数 (MGCC)；其次，为提升CNN对声纹特征的表达，引入改进的通道特征响应SE-Block模型；最后，在改进的压缩激励卷积网络(CNN-SE-Net) 提取空间特征的基础上进一步通过双向门控循环单元网络(BiGRU)提取时序特征，提升整个网络的性能。实验结果表明，在不同噪声背景下，MGCC声学特征表现出更强的表征能力和更好的鲁棒性，而在MGCC声学特征下CNN-SE-BiGRU模型的平均识别率最高为96.05%，充分证明了所提方法的有效性和鲁棒性。

关键词: 声纹识别, 融合特征, 双向门控循环单元, SE-Block, 卷积神经网络(CNN)

Abstract: In order to solve the problems of single feature, weak representation ability and anti-noise ability in the field of voiceprint recognition, weak feature expression ability of traditional convolutional neural network (CNN) model and incomplete acquisition of temporal features, an acoustic feature fused with Mel frequency cepstral coefficient (MFCC) and Gamma frequency cepstral coefficient (GFCC) was proposed to carry out voiceprint recognition with a novel voiceprint recognition model based on enhanced CNN and bidirectional GRU networks (CNN-SE-BiGRU). Firstly, the extracted MFCC features and GFCC features are normalized, and according to the inter-class discrimination power, appropriate weights are designed to linearly combine the MFCC and GFCC features, and the Mel-gammatone cepstral coefficients (MGCC) with stronger speaker discrimination were obtained. Secondly, in order to improve the expression of CNN to features, an improved channel feature response SE-Block (squeeze and excitation block) model was introduced. Finally, building upon the spatial features extracted by the enhanced squeeze-and-excitation CNN (CNN-SE), the time series features are further extracted through the bidirectional gated recurrent unit network (BiGRU) to improve the performance of the whole network. Experimental results show that the acoustic features of MGCC show stronger characterization ability and better robustness under different noise backgrounds, while the average recognition rate of the CNN-SE-BiGRU model can be 96.05% under MGCC acoustic features, which fully proves the effectiveness and robustness of the proposed method.

Key words: voiceprint recognition, fusion features, bidirectional gated recurrent unit, squeeze and excitation block, convolutional neural network (CNN)

中图分类号:

TN912.34

范涛, 詹旭. 基于融合特征MGCC和CNN-SE-BiGRU的声纹识别[J]. 中国科学院大学学报, 2025, 42(6): 832-842.

FAN Tao, ZHAN Xu. Voiceprint recognition based on fused MGCC and CNN-SE-BiGRU features[J]. Journal of University of Chinese Academy of Sciences, 2025, 42(6): 832-842.

参考文献

[1] 张卫强, 刘加. 基于听感知特征的语种识别[J]. 清华大学学报(自然科学版), 2009, 49(1): 78-81. DOI: 10.16511/j.cnki.qhdxxb.2009.01.020.
[2] 李泽, 崔宣, 马雨廷,等.MFCC和LPCC特征参数在说话人识别中的研究[J].河南工程学院学报(自然科学版),2010,22(2):51-55.DOI:10.3969/j.issn.1674-330X.2010.02.013.
[3] Alam M J, Kinnunen T, Kenny P, et al. Multitaper MFCC and PLP features for speaker verification using i-vectors[J]. Speech Communication, 2013, 55(2): 237-251. DOI: 10.1016/j.specom.2012.08.007.
[4] Paul S B S, Glittas A X, Gopalakrishnan L. A low latency modular-level deeply integrated MFCC feature extraction architecture for speech recognition[J]. Integration, 2021, 76: 69-75. DOI: 10.1016/j.vlsi.2020.09.002.
[5] 唐步天,郭立,刘振华.利用MFCC的语音信息隐藏方法[J].中国科学院研究生院学报, 2008, 25(3):386-394.DOI: 10.7523/j.issn.2095-6134.2008.3.014.
[6] 贾艳洁, 陈曦,于洁琼, 等. 基于特征语谱图和自适应聚类SOM的快速说话人识别[J]. 科学技术与工程, 2019, 19(15): 211-218. DOI: 10.3969/j.issn.1671-1815.2019.15.034.
[7] Lin Z D, Di C G, Chen X. Bionic optimization of MFCC features based on speaker fast recognition[J]. Applied Acoustics, 2021, 173: 107682. DOI: 10.1016/j.apacoust.2020.107682.
[8] Zhang P Y, Chen H T, Bai H C, et al. Deep scattering spectra with deep neural networks for acoustic scene classification tasks[J]. Chinese Journal of Electronics, 2019, 28(6): 1177-1183. DOI: 10.1049/cje.2019.07.006.
[9] 周萍,沈昊,郑凯鹏.基于MFCC与GFCC混合特征参数的说话人识别[J].应用科学学报, 2019, 37(1):24-32.DOI:10.3969/j.issn.0255-8297.2019.01.003.
[10] Mohd Hanifa R, Isa K, Mohamad S. A review on speaker recognition: technology and challenges[J]. Computers & Electrical Engineering, 2021, 90: 107005. DOI: 10.1016/j.compeleceng.2021.107005.
[11] 杨建斌, 张卫强,刘加.深度神经网络自适应中基于身份认证向量的归一化方法[J].中国科学院大学学报, 2017, 34(5): 633-639. DOI:10.7523/j.issn.2095-6134.2017. 05.014.
[12] 罗春梅,张风雷.基于均值特征和改进深度神经网络的说话人识别算法[J].声学技术,2021,40(4):503-507. DOI: 10.16300/j.cnki.1000-3630.2021.04.010.
[13] 南兆营. 基于卷积神经网络的法庭说话人识别研究[J]. 电声技术, 2021, 45(2): 23-27, 31.DOI: 10.16311/j.audioe.2021.02.006.
[14] 林舒都, 邵曦. 基于i-vector和深度学习的说话人识别[J]. 计算机技术与发展, 2017, 27(6): 66-71. DOI: 10.3969/j.issn.1673-629X.2017.06.014.
[15] Li Y Q, Zhang X L, Zhang X Y, et al. Unconstrained vocal pattern recognition algorithm based on attention mechanism[J]. Digital Signal Processing, 2023, 136: 103973. DOI: 10.1016/j.dsp.2023.103973.
[16] Zhang Q R, Zhai H T, Ma Y Y, et al.Enhanced-deep-residual-shrinkage-network-based voiceprint recognition in the electric industry[J]. Electronics, 2023, 12(14): 3017. DOI: 10.3390/electronics12143017.
[17] Velayuthapandian K, Subramoniam S P. A focus module-based lightweight end-to-end CNN framework for voiceprint recognition[J]. Signal, Image and Video Processing, 2023, 17(6): 2817-2825. DOI: 10.1007/s11760-023-02500-7.
[18] Ghosh U, Mondal U K. Improved wireless acoustic sensor network for analysing audio properties[J]. International Journal of Information Technology, 2023, 15(7): 3679-3687. DOI: 10.1007/s41870-023-01411-7.
[19] 黄张衡, 龙华, 邵玉斌, 等. 噪声环境下听觉特征融合的语种识别[J]. 现代电子技术, 2023, 46(5): 47-54. DOI: 10.16652/j.issn.1004-373x.2023.05.010.
[20] 王华朋, 牛瑾琳, 刘元周, 等.不同语音特征对声音分类的有效性研究[J].中国刑警学院学报,2020(6): 122-128. DOI:10.14060/j.issn.2095-7939.2020.06.017.
[21] 朱晓丽, 李吉祥, 陈明,等.基于MFCC和GFCC特征融合的汽车鸣笛声识别方法[J].电脑与信息技术, 2023, 31(2): 24-26, 30. DOI: 10.19414/j.cnki.1005-1228.2023.02.003.
[22] 龙翔, 夏秀渝. 基于融合频域和时域特征的说话人识别[J]. 现代计算机, 2022, 28(11): 25-30. DOI: 10.3969/j.issn.1007-1423.2022.11.004.
[23] Agrawal D M, Sailor H B, Soni M H, et al. Novel TEO-based Gammatone features for environmental sound classification[C]//2017 25th European Signal Processing Conference (EUSIPCO). Kos, Greece. IEEE, 2017: 1809-1813. DOI: 10.23919/ EUSIPCO.2017.8081521.
[24] Efat M I A, Hossain M S, Aditya S, et al. Identifying optimised speaker identification model using hybrid GRU-CNN feature extraction technique [J]. International Journal of Computational Vision and Robotics, 2022, 12(6): 662. DOI: 10.1504/ijcvr.2022.126508.
[25] 薛丽, 郑含笑, 吴昊辰.基于CNN-BiGRU的学术文本分类研究[J].郑州航空工业管理学院学报, 2023,41(3):61-68. DOI: 10.19327/j.cnki.zuaxb.1007-9734.2023.03.008.
[26] Eknath K G, Diwakar G. Prediction of Remaining useful life of Rolling Bearing using Hybrid DCNN-BiGRU Model[J]. Journal of Vibration Engineering & Technologies, 2023, 11(3): 997-1010. DOI: 10.1007/s42417-022-00620-x.
[27] Hu J, Shen L, Albanie S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011-2023. DOI: 10.1109/TPAMI.2019.2913372.