基于融合特征MGCC和CNN-SE-BiGRU的声纹识别研究1

doi:10.7523/j.ucas.2024.004

摘要/Abstract

摘要： 针对声纹识别领域单一特征的表征能力和抗噪能力不强,和传统CNN模型特征表达能力不强和时序特征获取不全面的问题,提出一种梅尔频率倒谱系数(MFCC)和伽马频率倒谱系数(GFCC) 融合的声学特征,与一种新的深度网络结构-压缩激励机制的卷积神经网络和双向门控循环单元网络集成网络(CNN-SE-BiGRU),进行声纹识别。首先,对提取的语音MFCC特征和GFCC特征分别进行归一化,根据特征类间区分度,设计合适的权值对MFCC和GFCC两个特征用线性加权,得到说话人区分性更强的梅尔伽马倒谱系数 (mel-gammatone cepstral coefficients,MGCC);其次,为提升CNN对声纹特征的表达,引入改进的通道特征响应SE-Block (squeeze and excitation block)模型;最后,在改进的压缩激励卷积网络(CNN-SE-Net) 提取空间特征的基础上进一步通过双向门控循环单元网络(BiGRU)提取时序特征,提升整个网络的性能。实验结果表明,在不同噪声背景,MGCC声学特征表现出更强的表征能力和更好的鲁棒性,而在MGCC声学特征下CNN-SE-BiGRU模型的平均识别率最高为96.05%,充分证明了所提方法的有效性和鲁棒性。

关键词: 声纹识别, 融合特征, 双向门控循环单元, SE-Block

Abstract: In order to solve the problems of single feature, weak representation ability and anti-noise ability in the field of voiceprint recognition, weak feature expression ability of traditional CNN model and incomplete acquisition of temporal features, an acoustic feature fused with mel frequency cepstral coefficient (MFCC) and gamma frequency cepstrum coefficient (GFCC) was proposed tocarry out voiceprint recognition with a new deep network structure-compression excitation mechanism convolutional neural network and bidirectional gated recurrent unit network integration network (CNN-SE-BiGRU). Firstly, the extracted speech MFCC features and GFCC features were normalized respectively, and according to the discrimination between feature classes, appropriate weights were designed to linearly weight the MFCC and GFCC features, and the mel-gammatone cepstral coefficients (MGCC) with stronger speaker discrimination were obtained. Secondly, in order to improve the expression of CNN to features, an improved channel feature response SE-Block (squeeze and excitation block) model was introduced. Finally, on the basis of the improved compressed excitation convolutional network (CNN-SE-Net) to extract spatial features, the time series features are further extracted through the bidirectional gated recurrent unit network (BiGRU) to improve the performance of the whole network. Experimental results show that the acoustic features of MGCC show stronger characterization ability and better robustness under different noise backgrounds, while the average recognition rate of the CNN-SE-BiGRU model is the highest under MGCC acoustic features of 96.05%, which fully proves the effectiveness and robustness of the proposed method.

Key words: voiceprint recognition, fusion features, bidirectional gated recirculating unit, squeeze and excitation block

中图分类号:

TN912.34

范涛, 詹旭. 基于融合特征MGCC和CNN-SE-BiGRU的声纹识别研究¹[J]. 中国科学院大学学报, DOI: 10.7523/j.ucas.2024.004.

FAN Tao, ZHAN Xu. Research on voiceprint recognition based on fusion features MGCC and CNN-SE-BiGRU[J]. Journal of University of Chinese Academy of Sciences, DOI: 10.7523/j.ucas.2024.004.

参考文献

[1] 张卫强, 刘加. 基于听感知特征的语种识别[J]. 清华大学学报(自然科学版), 2009, 49(1): 78-81. DOI: 10.16511/j.cnki.qhdxxb.2009.01.020.
[2] 李泽, 崔宣, 马雨廷,等.MFCC和LPCC特征参数在说话人识别中的研究[J].河南工程学院学报(自然科学版),2010,22(2):51-55.DOI:10.3969/j.issn.1674-330X.2010.02.013.
[3] Alam M J, Kinnunen T, Kenny P, et al.Multitaper MFCC and PLP features for speaker verification using i-vectors[J]. Speech Communication, 2013, 55(2): 237-251. DOI: 10.1016/j.specom.2012.08.007.
[4] Paul S B S, Glittas A X, Gopalakrishnan L. A low latency modular-level deeply integrated MFCC feature extraction architecture for speech recognition[J]. Integration, 2021, 76: 69-75. DOI: 10.1016/j.vlsi.2020.09.002.
[5] 唐步天,郭立,刘振华.利用MFCC的语音信息隐藏方法[J].中国科学院研究生院学报, 2008, 25(3):386-394.DOI: 10.7523/j.issn.2095-6134.2008.3.014.
[6] 贾艳洁, 陈曦,于洁琼, 等. 基于特征语谱图和自适应聚类SOM的快速说话人识别[J]. 科学技术与工程, 2019, 19(15): 211-218. DOI: 10.3969/j.issn.1671-1815.2019.15.034.
[7] Lin Z D, Di C G, Chen X.Bionic optimization of MFCC features based on speaker fast recognition[J]. Applied Acoustics, 2021, 173: 107682. DOI: 10.1016/j.apacoust.2020.107682.
[8] Zhang P Y, Chen H T, Bai H C, et al.Deep scattering spectra with deep neural networks for acoustic scene classification tasks[J]. Chinese Journal of Electronics, 2019, 28(6): 1177-1183. DOI: 10.1049/cje.2019.07.006.
[9] 周萍,沈昊,郑凯鹏.基于MFCC与GFCC混合特征参数的说话人识别[J].应用科学学报, 2019, 37(1):24-32.DOI:10.3969/j.issn.0255-8297.2019.01.003.
[10] Mohd Hanifa R, Isa K, Mohamad S.A review on speaker recognition: Technology and challenges[J]. Computers & Electrical Engineering, 2021, 90: 107005. DOI: 10.1016/j.compeleceng.2021.107005.
[11] 杨建斌, 张卫强,刘加.深度神经网络自适应中基于身份认证向量的归一化方法[J].中国科学院大学学报, 2017, 34(5): 633-639. DOI:10.7523/j.issn.2095-6134.2017.05.014.
[12] 罗春梅,张风雷.基于均值特征和改进深度神经网络的说话人识别算法[J].声学技术,2021,40(4):503-507. DOI: 10.16300/j.cnki.1000-3630.2021.04.010.
[13] 南兆营. 基于卷积神经网络的法庭说话人识别研究[J]. 电声技术, 2021, 45(2): 23-27, 31.DOI: 10.16311/j.audioe.2021.02.006.
[14] 林舒都, 邵曦. 基于i-vector和深度学习的说话人识别[J]. 计算机技术与发展, 2017, 27(6): 66-71. DOI: 10.3969/j.issn.1673-629X.2017.06.014.
[15] Li Y Q, Zhang X L, Zhang X Y, et al.Unconstrained vocal pattern recognition algorithm based on attention mechanism[J]. Digital Signal Processing, 2023, 136: 103973. DOI: 10.1016/j.dsp.2023.103973.
[16] Zhang Q R, Zhai H T, Ma Y Y, et al.Enhanced-deep-residual-shrinkage-network-based voiceprint recognition in the electric industry[J]. Electronics, 2023, 12(14): 3017. DOI: 10.3390/electronics12143017.
[17] Velayuthapandian K, Subramoniam S P.A focus module-based lightweight end-to-end CNN framework for voiceprint recognition[J]. Signal, Image and Video Processing, 2023, 17(6): 2817-2825. DOI: 10.1007/s11760-023-02500-7.
[18] Ghosh U, Mondal U K.Improved wireless acoustic sensor network for analysing audio properties[J]. International Journal of Information Technology, 2023, 15(7): 3679-3687. DOI: 10.1007/s41870-023-01411-7.
[19] 黄张衡, 龙华, 邵玉斌, 等. 噪声环境下听觉特征融合的语种识别[J]. 现代电子技术, 2023, 46(5): 47-54. DOI: 10.16652/j.issn.1004-373x.2023.05.010.
[20] 王华朋, 牛瑾琳, 刘元周, 等.不同语音特征对声音分类的有效性研究[J].中国刑警学院学报,2020(6): 122-128. DOI:10.14060/j.issn.2095-7939.2020.06.017.
[21] 朱晓丽, 李吉祥, 陈明,等.基于MFCC和GFCC特征融合的汽车鸣笛声识别方法[J].电脑与信息技术, 2023, 31(2): 24-26, 30. DOI: 10.19414/j.cnki.1005-1228.2023.02.003.
[22] 龙翔, 夏秀渝. 基于融合频域和时域特征的说话人识别[J]. 现代计算机, 2022, 28(11): 25-30. DOI: 10.3969/j.issn.1007-1423.2022.11.004.
[23] Agrawal D M, Sailor H B, Soni M H, et al.Novel TEO-based Gammatone features for environmental sound classification[C]//2017 25th European Signal Processing Conference (EUSIPCO). Kos, Greece. IEEE, 2017: 1809-1813. DOI: 10.23919/EUSIPCO.2017.8081521.
[24] Efat M I A, Hossain M S, Aditya S, et al. Identifying optimised speaker identification model using hybrid GRU-CNN feature extraction technique[J]. International Journal of Computational Vision and Robotics, 2022, 12(6): 662. DOI: 10.1504/ijcvr.2022.126508.
[25] 薛丽, 郑含笑, 吴昊辰.基于CNN-BiGRU的学术文本分类研究[J].郑州航空工业管理学院学报, 2023,41(3):61-68. DOI: 10.19327/j.cnki.zuaxb.1007-9734.2023.03.008.
[26] Eknath K G, Diwakar G.Prediction of Remaining useful life of Rolling Bearing using Hybrid DCNN-BiGRU Model[J]. Journal of Vibration Engineering & Technologies, 2023, 11(3): 997-1010. DOI: 10.1007/s42417-022-00620-x.
[27] Hu J, Shen L, Albanie S, et al.Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011-2023. DOI: 10.1109/TPAMI.2019.2913372.

基于融合特征MGCC和CNN-SE-BiGRU的声纹识别研究¹

Research on voiceprint recognition based on fusion features MGCC and CNN-SE-BiGRU

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 1

编辑推荐

Metrics

本文评价

访问统计

联系我们