Two stream LSTM based on self-supervised learning for online action detection

doi:10.7523/j.ucas.2021.0049

Abstract

Abstract: Online action detection plays very important role in many applications such as security and human-computer interaction. This mission requires that the system can detect the action when it just started, instead of waiting for the entire action comes to an end. Since in online action detection problem models can only make judgments based on the observed part of the video, so compared to other tasks such as action recognition and action detection, the model needs to dig out more from history information to assist decision-making for current frame. Based on the long short-term memory (LSTM) model commonly used in online action detection problems, this paper constructs a two-stream LSTM model called 2S-LSTM, and introduces the self-supervised learning idea, which is widely used in the image field, into the online action detection problem. First, the two-stream network 2S-LSTM model uses LSTM to model the temporal information of RGB flow and optical flow respectively. Moreover, based on the idea of self-supervised learning we construct two new loss functions:temporal similarity loss and optical flow stability loss for training. Experiments show that, compared with the past online motion detection methods such as RED, TRN, and IDN, our model in has achieved better results on both the TVSeries and THUMOS’14 datasets.

Key words: self-supervised learning, two-stream LSTM networks(2S-LSTM), online action detection, temporal similarity loss, optical flow stability loss

CLC Number:

TP181

ZHU Jiatong, QING Laiyun, HUANG Qingming. Two stream LSTM based on self-supervised learning for online action detection[J]. Journal of University of Chinese Academy of Sciences, 2022, 39(6): 827-835.

References

[1] Gao J Y, Yang Z H, Nevatia R. RED: reinforced encoder-decoder networks for action anticipation[C]//Proceedings of the British Machine Vision Conference 2017 (BMVC). September 4-7, 2017, London, UK. British Machine Vision Association, 2017: 92.1-92.11. DOI:10.5244/c.31.92.
[2] Xu M Z, Gao M F, Chen Y T, et al. Temporal recurrent networks for online action detection[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV). October 27-November 2, 2019, Seoul, Korea (South). IEEE, 2019: 5531-5540. DOI:10.1109/ICCV.2019.00563.
[3] Eun H, Moon J, Park J, et al. Learning to discriminate information for online action detection[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 13-19, 2020, Seattle, WA, USA. IEEE, 2020: 806-815. DOI:10.1109/CVPR42600.2020.00089.
[4] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. DOI:10.1162/neco.1997.9.8.1735.
[5] Tran D, Nourdev L D, Fergus R, et al. C3D: generic features for video analysis[EB/OL]. arXiv: 1412.0767v1. (2014-12-02) [2021-05-10].https://doi.org/10.48550/arXiv.1412.0767.
[6] Jiang G Y, Liu J, Zamir Roshan A, et al. THUMOS challenge: action recognition with a large number of classes[EB/OL]. (2014-08-20) [2021-05-16]. http://crcv.ucf.edu/THUMOS14. 2014.
[7] de Geest R, Gavves E, Ghodrati A, et al. Online action detection[M]//Computer Vision-ECCV 2016. Cham: Springer International Publishing, 2016: 269-284. DOI:10.1007/978-3-319-46454-1_17.
[8] De Geest R, Tuytelaars T. Modeling temporal structure with LSTM for online action detection[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). March 12-15, 2018, Lake Tahoe, NV, USA. IEEE, 2018:1549-1557. DOI:10.1109/WACV.2018.00173.
[9] Cho K, van Merrienboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar. Stroudsburg, PA, USA: Association for Computational Linguistics, 2014: 1724-1734. DOI:10.3115/v1/d14-1179.
[10] Lee H Y, Huang J B, Singh M, et al. Unsupervised representation learning by sorting sequences[C]//2017 IEEE International Conference on Computer Vision (ICCV). October 22-29, 2017, Venice, Italy. IEEE, 2017: 667-676. DOI:10.1109/ICCV.2017.79.
[11] Luo D Z, Liu C, Zhou Y, et al. Video cloze procedure for self-supervised spatio-temporal learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11701-11708. DOI:10.1609/aaai.v34i07.6840.
[12] Xu D J, Xiao J, Zhao Z, et al. Self-supervised spatiotemporal learning via video clip order prediction[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 15-20, 2019, Long Beach, CA, USA. IEEE, 2019: 10326-10335. DOI:10.1109/CVPR.2019.01058.
[13] Kim D, Cho D, Kweon I S. Self-supervised video representation learning with space-time cubic puzzles[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33: 8545-8552. DOI:10.1609/aaai.v33i01. 33018545.
[14] Jayaraman D, Grauman K. Slow and steady feature analysis: higher order temporal coherence in video[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 27-30, 2016, Las Vegas, NV, USA. IEEE, 2016: 3852-3861. DOI:10.1109/CVPR.2016.418.
[15] Wang L M, Xiong Y J, Wang Z, et al. Temporal segment networks: towards good practices for deep action recognition[EB/OL]. arXiv.1608.00859. (2016-08-02) [2021-05-16]. https://doi.org/10.48550/arXiv.1608.00859.
[16] Belkin M, Niyogi P. Laplacian eigenmaps and spectral techniques for embedding and clustering[M]//Advances in Neural Information Processing Systems 14: Proceedings of the 2001 Conference.The MIT Press, 2002:585-591. DOI:10.7551/mitpress/1120.003.0080.
[17] Heilbron F C, Escorcia V, Ghanem B, et al. ActivityNet: a large-scale video benchmark for human activity understanding[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 7-12, 2015, Boston, MA, USA. IEEE, 2015: 961-970. DOI:10.1109/CVPR.2015.7298698.
[18] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 27-30, 2016, Las Vegas, NV, USA. IEEE, 2016: 770-778. DOI:10.1109/CVPR.2016.90.
[19] Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal Covariate Shift[C]//2015 International Conference on Machine Learning (ICML). July 6-11, 2015, Lille, France. PMLR, 2015: 448-456.