基于多状态跳转模型的场景独立音频事件检测方法

doi:10.7523/j.issn.2095-6134.2019.02.009

中国科学院大学学报 ›› 2019, Vol. 36 ›› Issue (2): 218-225.DOI: 10.7523/j.issn.2095-6134.2019.02.009

基于多状态跳转模型的场景独立音频事件检测方法

王健飞, 张卫强, 刘加

清华大学电子工程系, 北京 100084

收稿日期:2017-12-06 修回日期:2018-04-08 发布日期:2019-03-15
通讯作者: 张卫强
基金资助:
国家自然科学基金（U1836219）资助

Scene-independent sound event detection based on multi-state transition model

WANG Jianfei, ZHANG Weiqiang, LIU Jia

Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

Received:2017-12-06 Revised:2018-04-08 Published:2019-03-15

摘要/Abstract

摘要： 针对不同类型事件设计多状态跳转模型，结合两种深度神经网络实现对传统音频事件检测框架的改进。实验表明，在DCASE2017任务2的开发集数据上，改进后的DNN-HMM系统相比于基线系统取得F值8.9%的相对提升和错误率19%的绝对下降；基于多状态跳转模型聚类的卷积神经网络模型（SC-CNN），相比于基线系统取得F值18%的相对提升和错误率30%的绝对下降。

关键词: 音频事件检测, 多状态跳转模型, 深度神经网络, 迁移学习, 多任务学习

Abstract: We designed the multi-state transition model for different types of sound events, and combined two kinds of deep neural network to achieve the improvement of the traditional framework. The performance evaluated on the DCASE2017 task2 development dataset showed that the improved DNN-HMM system outperformed the baseline and achieved 19% absolutely lower error rate (ER) and 8.9% relatively higher F-score. The state clustering convolutional neural network (SC-CNN) system based on multi-state transition model also achieved 18% relatively higher F-score and 30% absolutely lower ER, which has reached the international advanced level.

Key words: sound event detection, multi-state transition model, deep neural network, transfer learning, multitask learning

中图分类号:

TN912

王健飞, 张卫强, 刘加. 基于多状态跳转模型的场景独立音频事件检测方法[J]. 中国科学院大学学报, 2019, 36(2): 218-225.

WANG Jianfei, ZHANG Weiqiang, LIU Jia. Scene-independent sound event detection based on multi-state transition model[J]. , 2019, 36(2): 218-225.

参考文献

[1] Laffitte P, Sodoyer D, Tatkeu C, et al. Deep neural networks for automatic detection of screams and shouted speech in subway trains[C]//Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016:6460-6464.
[2] Parascandolo G, Huttunen H, Virtanen T. Recurrent neural networks for polyphonic sound event detection in real life recordings[C]//Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016:6440-6444.
[3] Schröder J, Anemiiller J, Goetze S. Classification of human cough signals using spectro-temporal Gabor filterbank features[C]//Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016:6455-6459.
[4] Xu M, Duan L Y, Xu C, et al. Event detection in basketball video using multiple modalities[C]//Information, Communications and Signal Processing, 2003 and Fourth Pacific Rim Conference on Multimedia. IEEE, 2003, 3:1526-1530.
[5] Knox M T, Morgan N, Mirghafori N. Getting the last laugh:automatic laughter segmentation in meetings[C]//INTERSPEECH. ISCA, 2008:797-800.
[6] Atrey P K, Maddage N C, Kankanhalli M S. Audio based event detection for multimedia surveillance[C]//Acoustics, Speech and Signal Processing(ICASSP).IEEE, 2006, 5:813-816.
[7] Smaragdis P. Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs[C]//ICA. Berlin:Springer, 2004,3195:494-499.
[8] Takahashi N, Gygli M, Pfister B, et al. Deep convolutional neural networks and data augmentation for acoustic event detection[C]//INTERSPEECH. ISCA, 2016,805:2982-2986.
[9] Aytar Y, Vondrick C, Torralba A. Soundnet:learning sound representations from unlabeled video[C]//Advances in Neural Information Processing Systems(NIPS). MIT Press, 2016:892-900.
[10] Zhuang X, Zhou X, Hasegawa-Johnson M A, et al. Real-world acoustic event detection[J]. Pattern Recognition Letters, 2010, 31(12):1543-1551.
[11] Hayashi T, Watanabe S, Toda T, et al. BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic sound event detection[C]//Acoustics, Speech and Signal Processing (ICASSP).IEEE, 2017:766-770.
[12] Hayashi T, Watanabe S, Toda T, et al. Convolutional bidirectional long short-term memory hidden Markov model hybrid system for polyphonic sound event detection[J]. Journal of the Acoustical Society of America, 2016, 140(4):3404.
[13] Mesaros A, Heittola T, Diment A, et al. DCASE 2017 challenge setup:tasks, datasets and baseline system[C]//Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop(DCASE2017). IEEE, 2017:85-92.
[14] Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling in speech recognition:the shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6):82-97.
[15] Hinton G E. Training products of experts by minimizing contrastive divergence[J]. Neural Computation, 2014, 14(8):1771-1800.
[16] Srivastava N, Hinton G, Krizhevsky A, et al. Dropout:a simple way to prevent neural networks from overfitting[J]. Journal of Machine Learning Research, 2014, 15(1):1929-1958.
[17] Ioffe S, Szegedy C. Batch normalization:accelerating deep network training by reducing internal covariate shift[C]//International Conference on Machine Learning. JMLR, 2015:448-456.
[18] Kingma D P, Ba J. Adam:a method for stochastic optimization[C]//International Conference for Learning Representations (ICLR). arXiv preprint. arXiv:1412.6980, 2014,6:1-13.
[19] Pan S J, Yang Q. A survey on transfer learning[J]. IEEE Transactions on knowledge and data engineering, 2010, 22(10):1345-1359.
[20] Mesaros A, Heittola T, Virtanen T. Metrics for polyphonic sound event detection[J]. Applied Sciences, 2016, 6(6):162.
[21] Zhou Q, Feng Z. Robust sound event detection through noise estimation and source separation using NMF[C]//Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop(DCASE2017). IEEE, 2017:138-142.
[22] Cakir E, Virtanen T. Convolutional recurrent neural networks for rare sound event detection[C]//Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop(DCASE2017). IEEE, 2017:27-31.

基于多状态跳转模型的场景独立音频事件检测方法

Scene-independent sound event detection based on multi-state transition model

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 10

编辑推荐

Metrics

本文评价

访问统计

联系我们

[1]	赵冰爽, 罗铁坚, 王承杰. 面向丰富曲调要素的影视配乐生成模型[J]. 中国科学院大学学报, 2026, 43(4): 566-575.
[2]	王寅达, 彭玲, 陈德跃, 李玮超. 基于改进U-Net模型的农业大棚遥感提取方法[J]. 中国科学院大学学报, 2024, 41(3): 375-386.
[3]	董文豪, 张怀. 基于迁移学习的岩屑岩性识别[J]. 中国科学院大学学报, 2023, 40(6): 743-750.
[4]	王慧玲, 谢卓辰, 梁旭文. 单粒子翻转对神经网络的影响分析与优化[J]. 中国科学院大学学报, 2021, 38(6): 832-840.
[5]	史达亨, 刘立刚, 周斌, 卜智勇. 跨时间迁移的多源无线信号指纹融合定位方法[J]. 中国科学院大学学报, 2021, 38(6): 817-824.
[6]	王泽隆, 徐向辉, 张雷. 基于仿真SAR图像深度迁移学习的自动目标识别[J]. 中国科学院大学学报, 2020, 37(4): 516-524.
[7]	郑敬华, 郭世泽, 高梁, 赵楠. 基于多任务学习的大五人格预测[J]. 中国科学院大学学报, 2018, 35(4): 550-560.
[8]	李松, 魏中浩, 张冰尘, 洪文. 深度卷积神经网络在迁移学习模式下的SAR目标识别[J]. 中国科学院大学学报, 2018, 35(1): 75-83.
[9]	杨建斌, 张卫强, 刘加. 深度神经网络自适应中基于身份认证向量的归一化方法[J]. 中国科学院大学学报, 2017, 34(5): 633-639.
[10]	关增达, 程立, 朱廷劭. 基于平移不变核的异构迁移学习[J]. 中国科学院大学学报, 2015, 32(1): 121-126.