Voiceprint recognition based on fused MGCC and CNN-SE-BiGRU features

doi:10.7523/j.ucas.2024.004

Abstract

Abstract: In order to solve the problems of single feature, weak representation ability and anti-noise ability in the field of voiceprint recognition, weak feature expression ability of traditional convolutional neural network (CNN) model and incomplete acquisition of temporal features, an acoustic feature fused with Mel frequency cepstral coefficient (MFCC) and Gamma frequency cepstral coefficient (GFCC) was proposed to carry out voiceprint recognition with a novel voiceprint recognition model based on enhanced CNN and bidirectional GRU networks (CNN-SE-BiGRU). Firstly, the extracted MFCC features and GFCC features are normalized, and according to the inter-class discrimination power, appropriate weights are designed to linearly combine the MFCC and GFCC features, and the Mel-gammatone cepstral coefficients (MGCC) with stronger speaker discrimination were obtained. Secondly, in order to improve the expression of CNN to features, an improved channel feature response SE-Block (squeeze and excitation block) model was introduced. Finally, building upon the spatial features extracted by the enhanced squeeze-and-excitation CNN (CNN-SE), the time series features are further extracted through the bidirectional gated recurrent unit network (BiGRU) to improve the performance of the whole network. Experimental results show that the acoustic features of MGCC show stronger characterization ability and better robustness under different noise backgrounds, while the average recognition rate of the CNN-SE-BiGRU model can be 96.05% under MGCC acoustic features, which fully proves the effectiveness and robustness of the proposed method.

Key words: voiceprint recognition, fusion features, bidirectional gated recurrent unit, squeeze and excitation block, convolutional neural network (CNN)

CLC Number:

TN912.34

FAN Tao, ZHAN Xu. Voiceprint recognition based on fused MGCC and CNN-SE-BiGRU features[J]. Journal of University of Chinese Academy of Sciences, 2025, 42(6): 832-842.

References

[1] 张卫强, 刘加. 基于听感知特征的语种识别[J]. 清华大学学报(自然科学版), 2009, 49(1): 78-81. DOI: 10.16511/j.cnki.qhdxxb.2009.01.020.
[2] 李泽, 崔宣, 马雨廷,等.MFCC和LPCC特征参数在说话人识别中的研究[J].河南工程学院学报(自然科学版),2010,22(2):51-55.DOI:10.3969/j.issn.1674-330X.2010.02.013.
[3] Alam M J, Kinnunen T, Kenny P, et al. Multitaper MFCC and PLP features for speaker verification using i-vectors[J]. Speech Communication, 2013, 55(2): 237-251. DOI: 10.1016/j.specom.2012.08.007.
[4] Paul S B S, Glittas A X, Gopalakrishnan L. A low latency modular-level deeply integrated MFCC feature extraction architecture for speech recognition[J]. Integration, 2021, 76: 69-75. DOI: 10.1016/j.vlsi.2020.09.002.
[5] 唐步天,郭立,刘振华.利用MFCC的语音信息隐藏方法[J].中国科学院研究生院学报, 2008, 25(3):386-394.DOI: 10.7523/j.issn.2095-6134.2008.3.014.
[6] 贾艳洁, 陈曦,于洁琼, 等. 基于特征语谱图和自适应聚类SOM的快速说话人识别[J]. 科学技术与工程, 2019, 19(15): 211-218. DOI: 10.3969/j.issn.1671-1815.2019.15.034.
[7] Lin Z D, Di C G, Chen X. Bionic optimization of MFCC features based on speaker fast recognition[J]. Applied Acoustics, 2021, 173: 107682. DOI: 10.1016/j.apacoust.2020.107682.
[8] Zhang P Y, Chen H T, Bai H C, et al. Deep scattering spectra with deep neural networks for acoustic scene classification tasks[J]. Chinese Journal of Electronics, 2019, 28(6): 1177-1183. DOI: 10.1049/cje.2019.07.006.
[9] 周萍,沈昊,郑凯鹏.基于MFCC与GFCC混合特征参数的说话人识别[J].应用科学学报, 2019, 37(1):24-32.DOI:10.3969/j.issn.0255-8297.2019.01.003.
[10] Mohd Hanifa R, Isa K, Mohamad S. A review on speaker recognition: technology and challenges[J]. Computers & Electrical Engineering, 2021, 90: 107005. DOI: 10.1016/j.compeleceng.2021.107005.
[11] 杨建斌, 张卫强,刘加.深度神经网络自适应中基于身份认证向量的归一化方法[J].中国科学院大学学报, 2017, 34(5): 633-639. DOI:10.7523/j.issn.2095-6134.2017. 05.014.
[12] 罗春梅,张风雷.基于均值特征和改进深度神经网络的说话人识别算法[J].声学技术,2021,40(4):503-507. DOI: 10.16300/j.cnki.1000-3630.2021.04.010.
[13] 南兆营. 基于卷积神经网络的法庭说话人识别研究[J]. 电声技术, 2021, 45(2): 23-27, 31.DOI: 10.16311/j.audioe.2021.02.006.
[14] 林舒都, 邵曦. 基于i-vector和深度学习的说话人识别[J]. 计算机技术与发展, 2017, 27(6): 66-71. DOI: 10.3969/j.issn.1673-629X.2017.06.014.
[15] Li Y Q, Zhang X L, Zhang X Y, et al. Unconstrained vocal pattern recognition algorithm based on attention mechanism[J]. Digital Signal Processing, 2023, 136: 103973. DOI: 10.1016/j.dsp.2023.103973.
[16] Zhang Q R, Zhai H T, Ma Y Y, et al.Enhanced-deep-residual-shrinkage-network-based voiceprint recognition in the electric industry[J]. Electronics, 2023, 12(14): 3017. DOI: 10.3390/electronics12143017.
[17] Velayuthapandian K, Subramoniam S P. A focus module-based lightweight end-to-end CNN framework for voiceprint recognition[J]. Signal, Image and Video Processing, 2023, 17(6): 2817-2825. DOI: 10.1007/s11760-023-02500-7.
[18] Ghosh U, Mondal U K. Improved wireless acoustic sensor network for analysing audio properties[J]. International Journal of Information Technology, 2023, 15(7): 3679-3687. DOI: 10.1007/s41870-023-01411-7.
[19] 黄张衡, 龙华, 邵玉斌, 等. 噪声环境下听觉特征融合的语种识别[J]. 现代电子技术, 2023, 46(5): 47-54. DOI: 10.16652/j.issn.1004-373x.2023.05.010.
[20] 王华朋, 牛瑾琳, 刘元周, 等.不同语音特征对声音分类的有效性研究[J].中国刑警学院学报,2020(6): 122-128. DOI:10.14060/j.issn.2095-7939.2020.06.017.
[21] 朱晓丽, 李吉祥, 陈明,等.基于MFCC和GFCC特征融合的汽车鸣笛声识别方法[J].电脑与信息技术, 2023, 31(2): 24-26, 30. DOI: 10.19414/j.cnki.1005-1228.2023.02.003.
[22] 龙翔, 夏秀渝. 基于融合频域和时域特征的说话人识别[J]. 现代计算机, 2022, 28(11): 25-30. DOI: 10.3969/j.issn.1007-1423.2022.11.004.
[23] Agrawal D M, Sailor H B, Soni M H, et al. Novel TEO-based Gammatone features for environmental sound classification[C]//2017 25th European Signal Processing Conference (EUSIPCO). Kos, Greece. IEEE, 2017: 1809-1813. DOI: 10.23919/ EUSIPCO.2017.8081521.
[24] Efat M I A, Hossain M S, Aditya S, et al. Identifying optimised speaker identification model using hybrid GRU-CNN feature extraction technique [J]. International Journal of Computational Vision and Robotics, 2022, 12(6): 662. DOI: 10.1504/ijcvr.2022.126508.
[25] 薛丽, 郑含笑, 吴昊辰.基于CNN-BiGRU的学术文本分类研究[J].郑州航空工业管理学院学报, 2023,41(3):61-68. DOI: 10.19327/j.cnki.zuaxb.1007-9734.2023.03.008.
[26] Eknath K G, Diwakar G. Prediction of Remaining useful life of Rolling Bearing using Hybrid DCNN-BiGRU Model[J]. Journal of Vibration Engineering & Technologies, 2023, 11(3): 997-1010. DOI: 10.1007/s42417-022-00620-x.
[27] Hu J, Shen L, Albanie S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011-2023. DOI: 10.1109/TPAMI.2019.2913372.