基于双流LSTM与自监督学习的 在线动作检测算法

doi:10.7523/j.ucas.2021.0049

摘要/Abstract

摘要： 在线动作检测对安防和人机交互等应用非常重要，该问题要求模型在动作刚开始时就能检测到，而不是等待整个事件完整结束。由于在线动作检测只能基于观测到的部分视频进行判断，因此相比动作识别和动作检测等任务，模型需要挖掘更多信息辅助决策。基于在线动作检测问题中常用的长短时记忆网络(LSTM)模型，构建双流LSTM模型(2S-LSTM)，并将在图像领域中被广泛使用的自监督学习思想引入到在线动作检测问题中。首先，双流网络2S-LSTM模型分别对RGB流与光流的时序信息采用LSTM建模。同时基于自监督学习的思想构建出2个新型的损失函数——时序相似度损失与光流稳定损失用于模型的训练。实验表明，与过去的在线动作检测方法RED、TRN、IDN相比，本文的模型在TVSeries与THUMOS’14这2个数据集上都取得了较好的结果。

关键词: 自监督学习, 双流LSTM(2S-LSTM), 在线动作检测, 时序相似度损失, 光流稳定损失

Abstract: Online action detection plays very important role in many applications such as security and human-computer interaction. This mission requires that the system can detect the action when it just started, instead of waiting for the entire action comes to an end. Since in online action detection problem models can only make judgments based on the observed part of the video, so compared to other tasks such as action recognition and action detection, the model needs to dig out more from history information to assist decision-making for current frame. Based on the long short-term memory (LSTM) model commonly used in online action detection problems, this paper constructs a two-stream LSTM model called 2S-LSTM, and introduces the self-supervised learning idea, which is widely used in the image field, into the online action detection problem. First, the two-stream network 2S-LSTM model uses LSTM to model the temporal information of RGB flow and optical flow respectively. Moreover, based on the idea of self-supervised learning we construct two new loss functions:temporal similarity loss and optical flow stability loss for training. Experiments show that, compared with the past online motion detection methods such as RED, TRN, and IDN, our model in has achieved better results on both the TVSeries and THUMOS’14 datasets.

Key words: self-supervised learning, two-stream LSTM networks(2S-LSTM), online action detection, temporal similarity loss, optical flow stability loss

中图分类号:

TP181

朱嘉桐, 卿来云, 黄庆明. 基于双流LSTM与自监督学习的在线动作检测算法[J]. 中国科学院大学学报, 2022, 39(6): 827-835.

ZHU Jiatong, QING Laiyun, HUANG Qingming. Two stream LSTM based on self-supervised learning for online action detection[J]. Journal of University of Chinese Academy of Sciences, 2022, 39(6): 827-835.

参考文献

[1] Gao J Y, Yang Z H, Nevatia R. RED: reinforced encoder-decoder networks for action anticipation[C]//Proceedings of the British Machine Vision Conference 2017 (BMVC). September 4-7, 2017, London, UK. British Machine Vision Association, 2017: 92.1-92.11. DOI:10.5244/c.31.92.
[2] Xu M Z, Gao M F, Chen Y T, et al. Temporal recurrent networks for online action detection[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV). October 27-November 2, 2019, Seoul, Korea (South). IEEE, 2019: 5531-5540. DOI:10.1109/ICCV.2019.00563.
[3] Eun H, Moon J, Park J, et al. Learning to discriminate information for online action detection[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 13-19, 2020, Seattle, WA, USA. IEEE, 2020: 806-815. DOI:10.1109/CVPR42600.2020.00089.
[4] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. DOI:10.1162/neco.1997.9.8.1735.
[5] Tran D, Nourdev L D, Fergus R, et al. C3D: generic features for video analysis[EB/OL]. arXiv: 1412.0767v1. (2014-12-02) [2021-05-10].https://doi.org/10.48550/arXiv.1412.0767.
[6] Jiang G Y, Liu J, Zamir Roshan A, et al. THUMOS challenge: action recognition with a large number of classes[EB/OL]. (2014-08-20) [2021-05-16]. http://crcv.ucf.edu/THUMOS14. 2014.
[7] de Geest R, Gavves E, Ghodrati A, et al. Online action detection[M]//Computer Vision-ECCV 2016. Cham: Springer International Publishing, 2016: 269-284. DOI:10.1007/978-3-319-46454-1_17.
[8] De Geest R, Tuytelaars T. Modeling temporal structure with LSTM for online action detection[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). March 12-15, 2018, Lake Tahoe, NV, USA. IEEE, 2018:1549-1557. DOI:10.1109/WACV.2018.00173.
[9] Cho K, van Merrienboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar. Stroudsburg, PA, USA: Association for Computational Linguistics, 2014: 1724-1734. DOI:10.3115/v1/d14-1179.
[10] Lee H Y, Huang J B, Singh M, et al. Unsupervised representation learning by sorting sequences[C]//2017 IEEE International Conference on Computer Vision (ICCV). October 22-29, 2017, Venice, Italy. IEEE, 2017: 667-676. DOI:10.1109/ICCV.2017.79.
[11] Luo D Z, Liu C, Zhou Y, et al. Video cloze procedure for self-supervised spatio-temporal learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11701-11708. DOI:10.1609/aaai.v34i07.6840.
[12] Xu D J, Xiao J, Zhao Z, et al. Self-supervised spatiotemporal learning via video clip order prediction[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 15-20, 2019, Long Beach, CA, USA. IEEE, 2019: 10326-10335. DOI:10.1109/CVPR.2019.01058.
[13] Kim D, Cho D, Kweon I S. Self-supervised video representation learning with space-time cubic puzzles[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33: 8545-8552. DOI:10.1609/aaai.v33i01. 33018545.
[14] Jayaraman D, Grauman K. Slow and steady feature analysis: higher order temporal coherence in video[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 27-30, 2016, Las Vegas, NV, USA. IEEE, 2016: 3852-3861. DOI:10.1109/CVPR.2016.418.
[15] Wang L M, Xiong Y J, Wang Z, et al. Temporal segment networks: towards good practices for deep action recognition[EB/OL]. arXiv.1608.00859. (2016-08-02) [2021-05-16]. https://doi.org/10.48550/arXiv.1608.00859.
[16] Belkin M, Niyogi P. Laplacian eigenmaps and spectral techniques for embedding and clustering[M]//Advances in Neural Information Processing Systems 14: Proceedings of the 2001 Conference.The MIT Press, 2002:585-591. DOI:10.7551/mitpress/1120.003.0080.
[17] Heilbron F C, Escorcia V, Ghanem B, et al. ActivityNet: a large-scale video benchmark for human activity understanding[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 7-12, 2015, Boston, MA, USA. IEEE, 2015: 961-970. DOI:10.1109/CVPR.2015.7298698.
[18] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 27-30, 2016, Las Vegas, NV, USA. IEEE, 2016: 770-778. DOI:10.1109/CVPR.2016.90.
[19] Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal Covariate Shift[C]//2015 International Conference on Machine Learning (ICML). July 6-11, 2015, Lille, France. PMLR, 2015: 448-456.