Field dynamic small object detection network based on double frame fusion

doi:10.7523/j.ucas.2023.008

Abstract

Abstract: Detecting dynamic small objects in complex environments in the field remains a challenging problem for defense and military applications due to factors such as more background interference in the field surveillance sensing systems, fewer pixels of small targets, and the lack of relevant open datasets. In order to solve this problem, a YOLOv5-based object detection network with double frame feature fusion (YOLO-DFNet) is proposed. Firstly, a double frame feature fusion module(D-F fusion) is introduced to process the adjacent frame features from the backbone network, calculating attention in channel, time, and space dimensions successively, to extract motion features. Secondly, a temporal trapezoidal fusion network based on an attention mechanism(TTFN_AM) is designed between the neck network and the detection head to focus on dynamic objects within receptive fields of different sizes, thereby improving the detection effect of small objects with large displacement. The experimental results on field motion small object dataset (FMSOD) show that the mean average precision (mAP) on different IoUs of the proposed YOLO-DFNet is 3.9 percentage points higher than that of YOLOv5, and also outperforms other object detection models such as Tph-YOLOv5 and YOLOv7.

Key words: object detection, field monitoring sensor network, dynamic small object, double-frame feature fusion, spatial-temporal attention

CLC Number:

TP391.4

ZHAO Xiaohan, ZHANG Zebin, LI Baoqing. Field dynamic small object detection network based on double frame fusion[J]. Journal of University of Chinese Academy of Sciences, 2024, 41(6): 810-820.

References

[1] Barron J L, Fleet D J, Beauchemin S S. Performance of optical flow techniques[J]. International Journal of Computer Vision, 1994, 12(1): 43-77. DOI: 10.1007/BF01420984.
[2] 刘鑫, 刘辉, 强振平, 等. 混合高斯模型和帧间差分相融合的自适应背景模型[J]. 中国图象图形学报, 2008, 13(4): 729-734. DOI: 10.11834/jig.20080422.
[3] Moeslund T B, Granum E. A survey of computer vision-based human motion capture[J]. Computer Vision and Image Understanding, 2001, 81(3): 231-268. DOI: 10.1006/cviu.2000.0897.
[4] Barnich O, Droogenbroeck M V. ViBE: a powerful random technique to estimate the background in video sequences[C]//2009 IEEE International Conference on Acoustics, Speech and Signal Processing. April 19-24, 2009. Taipei, China. IEEE, 2009. DOI: 10.1109/ICASSP.2009.4959741.
[5] 袁益琴, 何国金, 王桂周, 等. 背景差分与帧间差分相融合的遥感卫星视频运动车辆检测方法[J]. 中国科学院大学学报, 2018, 35(1): 50-58. DOI: 10.7523/j.issn.2095-6134.2018.01.007.
[6] 黄萍萍, 王峰, 向俞明, 等. 基于V-CSK视频遥感卫星运动目标检测跟踪方法[J]. 中国科学院大学学报, 2021, 38(3): 392-401. DOI: 10.7523/j.issn.2095-6134.2021.03.013.
[7] Ren S Q, He K M, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. DOI: 10.1109/TPAMI.2016.2577031.
[8] He K M, Gkioxari G, Dollar P, et al. Mask R-CNN[C]//2017 IEEE International Conference on Computer Vision (ICCV). October 22-29, 2017. Venice. IEEE, 2017: 2961-2969. DOI: 10.1109/iccv.2017.322.
[9] 王凤随, 王启胜, 陈金刚, 等. 基于注意力机制和Soft-NMS的改进Faster R-CNN目标检测算法[J]. 激光与光电子学进展, 2021, 58(24): 405-416. DOI: 10.3788/LOP202158.2420001.
[10] Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 27-30, 2016. Las Vegas, NV, USA. IEEE, 2016: 779-788. DOI: 10.1109/cvpr.2016.91.
[11] Redmon J, Farhadi A. YOLOv3: an incremental improvement[EB/OL]. arXiv: 1804.02767. (2018-04-08) [2022-10-07]. https://arxiv.org/abs/1804.02767.
[12] Bochkovskiy A, Wang C Y, Liao H Y M. YOLOv4: optimal speed and accuracy of object detection[EB/OL]. arXiv: 2004.10934. (2020-04-23)[2022-10-07]. https://arxiv.org/abs/2004.10934.
[13] 刘峰, 郭猛, 王向军. 基于跨尺度融合的卷积神经网络小目标检测[J]. 激光与光电子学进展, 2021, 58(6): 213-221. DOI: 10.3788/LOP202158.0610012.
[14] Liu W, Anguelov D, Erhan D, et al. SSD: single shot MultiBox detector[M]//Computer Vision-ECCV 2016. Cham: Springer International Publishing, 2016: 21-37. DOI: 10.1007/978-3-319-46448-0_2.
[15] Fu C Y, Liu W, Ranga A, et al. DSSD: Deconvolutional single shot detector[EB/OL]. arXiv: 1701.06659. (2017-01-23)[2022-10-07]. https:arxiv.org/abs/1701.06659.
[16] 耿鹏志, 杨智雄, 张家钧, 等. 基于SSD的行人鞋子检测算法[J]. 激光与光电子学进展, 2021, 58(6): 184-191. DOI: 10.3788/LOP202158.0610009.
[17] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). July 21-26, 2017, Honolulu, HI, USA. IEEE, 2017: 936-944. DOI: 10.1109/CVPR.2017.106.
[18] Liu S, Qi L, Qin H F, et al. Path aggregation network for instance segmentation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. June 18-23, 2018, Salt Lake City, UT, USA. IEEE, 2018: 8759-8768. DOI: 10.1109/CVPR.2018.00913.
[19] 汪亚妮, 汪西莉. 基于注意力和特征融合的遥感图像目标检测模型[J]. 激光与光电子学进展, 2021, 58(2): 363-371. DOI: 10.3788/LOP202158.0228003.
[20] Zhu X K, Lyu S C, Wang X, et al. TPH-YOLOv5: improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios[C]//2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). October 11-17, 2021, Montreal, BC, Canada. IEEE, 2021: 2778-2788. DOI: 10.1109/ICCVW54120.2021.00312.
[21] Woo S, Park J, Lee J Y, et al. CBAM: convolutional block attention module[M]//Computer Vision-ECCV 2018. Cham: Springer International Publishing, 2018: 3-19. DOI: 10.1007/978-3-030-01234-2_1.
[22] Hou Q B, Zhou D Q, Feng J S. Coordinate attention for efficient mobile network design[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 20-25, 2021, Nashville, TN, USA. IEEE, 2021: 13708-13717. DOI: 10.1109/CVPR46437.2021.01350.
[23] Liu Z, Lin Y T, Cao Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). October 10-17, 2021, Montreal, QC, Canada. IEEE, 2022: 9992-10002. DOI: 10.1109/ICCV48922.2021.00986.
[24] Long X, Deng K P, Wang G Z, et al. PP-YOLO: an effective and efficient implementation of object detector[EB/OL]. arXiv: 2007.12099. (2020-08-03)[2022-10-07]. https://arxiv.org/abs/2007.12099v3.
[25] Wang C Y, Yeh I H, Liao H Y M. You only learn one representation: unified network for multiple tasks[EB/OL]. arXiv: 2105.04206. (2021-05-10)[2022-10-07]. https://arxiv.org/abs/2105.04206v1.
[26] Wang C Y, Bochkovskiy A, Liao H Y M. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[EB/OL]. arXiv: 2207.02696. (2022-07-06)[2022-10-07]. https://arxiv.org/abs/2207.02696.