Scene-independent sound event detection based on multi-state transition model

doi:10.7523/j.issn.2095-6134.2019.02.009

Abstract

Abstract: We designed the multi-state transition model for different types of sound events, and combined two kinds of deep neural network to achieve the improvement of the traditional framework. The performance evaluated on the DCASE2017 task2 development dataset showed that the improved DNN-HMM system outperformed the baseline and achieved 19% absolutely lower error rate (ER) and 8.9% relatively higher F-score. The state clustering convolutional neural network (SC-CNN) system based on multi-state transition model also achieved 18% relatively higher F-score and 30% absolutely lower ER, which has reached the international advanced level.

Key words: sound event detection, multi-state transition model, deep neural network, transfer learning, multitask learning

CLC Number:

TN912

WANG Jianfei, ZHANG Weiqiang, LIU Jia. Scene-independent sound event detection based on multi-state transition model[J]. , 2019, 36(2): 218-225.

References

[1] Laffitte P, Sodoyer D, Tatkeu C, et al. Deep neural networks for automatic detection of screams and shouted speech in subway trains[C]//Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016:6460-6464.
[2] Parascandolo G, Huttunen H, Virtanen T. Recurrent neural networks for polyphonic sound event detection in real life recordings[C]//Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016:6440-6444.
[3] Schröder J, Anemiiller J, Goetze S. Classification of human cough signals using spectro-temporal Gabor filterbank features[C]//Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016:6455-6459.
[4] Xu M, Duan L Y, Xu C, et al. Event detection in basketball video using multiple modalities[C]//Information, Communications and Signal Processing, 2003 and Fourth Pacific Rim Conference on Multimedia. IEEE, 2003, 3:1526-1530.
[5] Knox M T, Morgan N, Mirghafori N. Getting the last laugh:automatic laughter segmentation in meetings[C]//INTERSPEECH. ISCA, 2008:797-800.
[6] Atrey P K, Maddage N C, Kankanhalli M S. Audio based event detection for multimedia surveillance[C]//Acoustics, Speech and Signal Processing(ICASSP).IEEE, 2006, 5:813-816.
[7] Smaragdis P. Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs[C]//ICA. Berlin:Springer, 2004,3195:494-499.
[8] Takahashi N, Gygli M, Pfister B, et al. Deep convolutional neural networks and data augmentation for acoustic event detection[C]//INTERSPEECH. ISCA, 2016,805:2982-2986.
[9] Aytar Y, Vondrick C, Torralba A. Soundnet:learning sound representations from unlabeled video[C]//Advances in Neural Information Processing Systems(NIPS). MIT Press, 2016:892-900.
[10] Zhuang X, Zhou X, Hasegawa-Johnson M A, et al. Real-world acoustic event detection[J]. Pattern Recognition Letters, 2010, 31(12):1543-1551.
[11] Hayashi T, Watanabe S, Toda T, et al. BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic sound event detection[C]//Acoustics, Speech and Signal Processing (ICASSP).IEEE, 2017:766-770.
[12] Hayashi T, Watanabe S, Toda T, et al. Convolutional bidirectional long short-term memory hidden Markov model hybrid system for polyphonic sound event detection[J]. Journal of the Acoustical Society of America, 2016, 140(4):3404.
[13] Mesaros A, Heittola T, Diment A, et al. DCASE 2017 challenge setup:tasks, datasets and baseline system[C]//Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop(DCASE2017). IEEE, 2017:85-92.
[14] Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling in speech recognition:the shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6):82-97.
[15] Hinton G E. Training products of experts by minimizing contrastive divergence[J]. Neural Computation, 2014, 14(8):1771-1800.
[16] Srivastava N, Hinton G, Krizhevsky A, et al. Dropout:a simple way to prevent neural networks from overfitting[J]. Journal of Machine Learning Research, 2014, 15(1):1929-1958.
[17] Ioffe S, Szegedy C. Batch normalization:accelerating deep network training by reducing internal covariate shift[C]//International Conference on Machine Learning. JMLR, 2015:448-456.
[18] Kingma D P, Ba J. Adam:a method for stochastic optimization[C]//International Conference for Learning Representations (ICLR). arXiv preprint. arXiv:1412.6980, 2014,6:1-13.
[19] Pan S J, Yang Q. A survey on transfer learning[J]. IEEE Transactions on knowledge and data engineering, 2010, 22(10):1345-1359.
[20] Mesaros A, Heittola T, Virtanen T. Metrics for polyphonic sound event detection[J]. Applied Sciences, 2016, 6(6):162.
[21] Zhou Q, Feng Z. Robust sound event detection through noise estimation and source separation using NMF[C]//Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop(DCASE2017). IEEE, 2017:138-142.
[22] Cakir E, Virtanen T. Convolutional recurrent neural networks for rare sound event detection[C]//Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop(DCASE2017). IEEE, 2017:27-31.