Sample optimization based on local features in speech emotion recognition

doi:10.7523/j.issn.2095-6134.2017.04.004

Abstract

Abstract: Emotion recognition is one of the most prospective technics in human-machine interaction process. Most researches prefer statistical functional features because these features are more consistent with the speech variation as emotion changes. However, local features, i.e., short-term or temporal features extracted from single frame also contain useful information. In this work, a new approach is proposed to optimize samples via local features. To achieve this, a K-means cluster is employed to cluster each sample with 2 groups: frames with obvious emotion and frames with emotion which is not that obvious. It is hypothesized that the cluster with more frames should be emotionally obvious. It is observed in the results that the classification performs better on optimized samples than on original ones. The method was tested on 3 corpora and the classification accuracy increases by 5%-17%. It is also found the improvement increases as speech length grows, which implies the optimization approach may be more applicable to the longer speech recognition.

Key words: speech emotion recognition, local features, global features, cluster analysis, sample optimization

CLC Number:

TN391.4

SUI Xiaoyun, ZHU Tingshao, WANG Jingying. Sample optimization based on local features in speech emotion recognition[J]. , 2017, 34(4): 431-438.

References

[1] 韩文静,李海峰,阮华斌,等.语音情感识别研究进展综述[J]. 软件学报, 2014, 25(1): 37-50.
[2] Gupta P, Rajput N. Two-stream emotion recognition for call center monitoring [C]//Proceedings of International Conference on Spoken Language Processing(Interspeech)2007.Antwerp: International Speech Communication Association (ISCA), 2007: 2 241-2 244.
[3] El Ayadi M, Kamel M S, Karray F. Survey on speech emotion recognition: features, classification schemes, and databases [J]. Pattern Recognition, 2011, 44 (3): 572-587.
[4] Vogt T, André E. Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition [C]//Proceedings of IEEE International Conference on Multimedia and Expo (ICME 2005). Amsterdam: IEEE, 2005: 474-477.
[5] Schuller B, Rigoll G. Timing levels in segment-based speech emotion recognition [C]//Proceedings of Interspeech 2006. Pittsburgh: ISCA, 2006:1 818-1 822.
[6] 林奕琳,韦岗.基于短时和长时特征的语音情感识别研究[J]. 科学技术与工程, 2006, 6(4):450-454.
[7] Kim E H, Hyun K H, Kim S H, et al. Speech ermotion recognition separately from voiced and unvoiced sound for emotional interaction robot [C]//International Conference on Control, Automation and Systems 2008. Seoul: IEEE, 2008: 2 014-2 019.
[8] Rao K S, Koolagudi S G, Vempada R R. Emotion recognition from speech using global and local prosodic features [J]. International Journal of Speech Technology, 2013, 16(2): 143-160.
[9] Burkhardt F, Paeschke A, Rolfes M, et al. A database of German emotional speech [C]//Proceedings of Interspeech 2005. Lisbon: ISCA, 2005: 1 517-1 520.
[10] Livingstone S R, Peck K, Russo F A. Ravdess: the ryerson audio-visual database of emotional speech and song [C]//Proceedings of the 22nd Annual Meeting of the Canadian Society for Brain, Behaviour and Cognitive Science (CSBBCS). Kingston:CSBBCS, 2012:71-72.
[11] 余伶俐,蔡自兴,陈明义,等.语音信号的情感特征分析与识别研究综述[J]. 电路与系统学报, 2007, 12(4):76-84.
[12] Eyben F, Batliner A, Schuller B, et al. Cross-Corpus classification of realistic emotions: some pilot experiments [C]//Proc. 3rd International Workshop on Emotion (satellite of LREC). Valletta: The Association for the Advancement of Affective Computing, 2010: 77-82.
[13] 蒋丹宁,蔡莲红.基于语音声学特征的情感信息识别[J]. 清华大学学报(自然科学版), 2006, 46(1):86-89.
[14] 韩文静, 李海峰, 韩纪庆. 基于长短时特征融合的语音情感识别方法[J]. 清华大学学报(自然科学版),2008, 48(S1): 708-714.
[15] Eyben F, Weninger F, Gross F, et al. Recent developments in opensmile, the munich open-source multimedia feature extractor[C]//Proceedings of the 21st ACM international conference on Multimedia. Barcelona: Association for Computing Machinery (ACM), 2013: 835-838.
[16] Petrushin V A. Emotion recognition in speech signal: experimental study, development, and application [C]//Proceedings of Interspeech 2000. Beijing: ISCA, 2000:222-225.
[17] Bhaykar M, Yadav J, Rao K S. Speaker dependent, speaker independent and cross language emotion recognition from speech using GMM and HMM [C]//Proceedings of National Conference on Communications (NCC) 2013.New Delhi: IEEE, 2013: 1-5.
[18] Kwon O W, Chan K, Hao J, et al. Emotion recognition by speech signals [C]//Proceedings of Interspeech 2003. Geneva: ISCA, 2003:125-128.
[19] 韩一,王国胤,杨勇. 基于MFCC的语音情感识别[J]. 重庆邮电大学学报(自然科学版),2008(5):597-602.
[20] Hall M, Frank E, Holmes G, et al. The WEKA data mining software: an update [J]. ACM SIGKDD explorations newsletter, 2009, 11(1): 10-18.
[21] Chang C C, Lin C J. LIBSVM: a library for support vector machines [J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 75-102.
[22] 朱菊霞,吴小培,吕钊,等.基于SVM的语音情感识别算法 [J]. 计算机系统应用,2011,20(5):87-91.
[23] 叶吉祥,张密霞,龚希龄,等.基于MF-DFA的语音情感识别 [J]. 长沙理工大学学报(自然科学版),2011,8(2):67-71.
[24] Zhang B, Essl G, Provost E M. Recognizing emotion from singing and speaking using shared models [C]//Proceedings of Affective Computing and Intelligent Interaction (ACⅡ) 2015.London: IEEE, 2015: 139-145.