Investigation of normalization methods in speaker adaptation of deep neural network using i-vector

doi:10.7523/j.issn.2095-6134.2017.05.014

Abstract

Abstract: The deep neural network (DNN) was a remarkable modeling technology for speech recognition in recent years and its performance was significantly better than that of the Gaussian mixture model,which was the mainstream modeling technology in speech recognition before.However,commendable adaptation of DNN has not been solved yet.In this work,we use the identity vector (i-vector) to adapt a deep neural network by putting i-vector and the regular speech features together as the input of DNN for both training and testing.Then we focus on the normalization method of i-vector using a new max-min linear normalization method.We get a 5.10%relative decrease in word error rate over the traditional length normalization method.

Key words: identity vector, deep neural network, speaker adaptation, normalization

CLC Number:

TN912

YANG Jianbin, ZHANG Weiqiang, LIU Jia. Investigation of normalization methods in speaker adaptation of deep neural network using i-vector[J]. , 2017, 34(5): 633-639.

References

[1] 李虎生, 刘加, 刘润生. 语音识别说话人自适应研究现状及发展趋势[J]. 电子学报, 2003, 31(1):103-108.
[2] Hinton G E, Osindero S, Toh Y W. A fast learning algorithm for deep belief nets[J]. Neural Computation, 2006, 18(7):1527.
[3] 孙志军, 薛磊, 许阳明, 等. 深度学习研究综述[J]. 计算机应用研究, 2012, 29(8):2806-2810.
[4] Abdel-Hamid O, Jiang H. Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013:7942-7946.
[5] Seide F, Li G, Chen X, et al. Feature engineering in context-dependent deep neural networks for conversational speech transcription//Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. IEEE, 2011:24-29.
[6] Gemello R, Mana F, Scanzio S, et al. Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative training//Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on. IEEE, 2006:1189-1192.
[7] Gupta V, Kenny P, Ouellet P, et al. i-Vector-based speaker adaptation of deep neural networks for french broadcast audio transcription//Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014:6334-6338.
[8] 栗志意.i-vector说话人识别建模关键技术与实用化研究.北京:清华大学,2014.
[9] Dehak N, Kenny P, Dehak R, et al. Front-end factor analysis for speaker verification[J]. Audio, Speech, and Language Processing, IEEE Transactions on, 2011, 19(4):788-798.
[10] Torbati A H H N, Picone J. A doubly hierarchical Dirichlet process hidden Markov model with a non-ergodic structure[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(1):174-184.
[11] Chen J, Wu L, Audhkhasi K, et al. Efficient one-vs-one kernel ridge regression for speech recognition//2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016:2454-2458.
[12] Sailor H B, Patil H A. Filterbank learning using Convolutional Restricted Boltzmann Machine for speech recognition//2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016:5895-5899.
[13] Ghahremani P, Droppo J, Seltzer M L. Linearly augmented deep neural network//2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016:5085-5089.