基于双帧融合的野外运动小目标检测网络

doi:10.7523/j.ucas.2023.008

摘要/Abstract

摘要： 由于野外监控传感系统中背景干扰较多、小目标像素点较少以及缺乏相关公开数据集等因素，在野外复杂环境中检测运动小目标仍然是国防军事应用中一个具有挑战性的问题。针对这一问题，提出一种基于YOLOv5改进的双帧融合目标检测网络(YOLO-DFNet)。首先，提出双帧融合模块用来处理骨干网络输出的相邻帧特征，通过计算通道及时间维度的注意力和空间注意力，提取运动特征；其次，在颈部网络与检测头之间设计一个时间梯形融合网络，关注不同大小感受野上的运动目标，改善大位移小目标的检测效果。在野外运动小目标数据集FMSOD上的实验结果表明：YOLO-DFNet在不同IoU上的平均精度比YOLOv5算法提高3.9个百分点，同时也优于TPH-YOLOv5、YOLOv7等其他目标检测网络。

关键词: 目标检测, 野外监控传感网, 运动小目标, 双帧融合, 时空注意力

Abstract: Detecting dynamic small objects in complex environments in the field remains a challenging problem for defense and military applications due to factors such as more background interference in the field surveillance sensing systems, fewer pixels of small targets, and the lack of relevant open datasets. In order to solve this problem, a YOLOv5-based object detection network with double frame feature fusion (YOLO-DFNet) is proposed. Firstly, a double frame feature fusion module(D-F fusion) is introduced to process the adjacent frame features from the backbone network, calculating attention in channel, time, and space dimensions successively, to extract motion features. Secondly, a temporal trapezoidal fusion network based on an attention mechanism(TTFN_AM) is designed between the neck network and the detection head to focus on dynamic objects within receptive fields of different sizes, thereby improving the detection effect of small objects with large displacement. The experimental results on field motion small object dataset (FMSOD) show that the mean average precision (mAP) on different IoUs of the proposed YOLO-DFNet is 3.9 percentage points higher than that of YOLOv5, and also outperforms other object detection models such as Tph-YOLOv5 and YOLOv7.

Key words: object detection, field monitoring sensor network, dynamic small object, double-frame feature fusion, spatial-temporal attention

中图分类号:

TP391.4

赵筱晗, 张泽斌, 李宝清. 基于双帧融合的野外运动小目标检测网络[J]. 中国科学院大学学报, 2024, 41(6): 810-820.

ZHAO Xiaohan, ZHANG Zebin, LI Baoqing. Field dynamic small object detection network based on double frame fusion[J]. Journal of University of Chinese Academy of Sciences, 2024, 41(6): 810-820.

参考文献

[1] Barron J L, Fleet D J, Beauchemin S S. Performance of optical flow techniques[J]. International Journal of Computer Vision, 1994, 12(1): 43-77. DOI: 10.1007/BF01420984.
[2] 刘鑫, 刘辉, 强振平, 等. 混合高斯模型和帧间差分相融合的自适应背景模型[J]. 中国图象图形学报, 2008, 13(4): 729-734. DOI: 10.11834/jig.20080422.
[3] Moeslund T B, Granum E. A survey of computer vision-based human motion capture[J]. Computer Vision and Image Understanding, 2001, 81(3): 231-268. DOI: 10.1006/cviu.2000.0897.
[4] Barnich O, Droogenbroeck M V. ViBE: a powerful random technique to estimate the background in video sequences[C]//2009 IEEE International Conference on Acoustics, Speech and Signal Processing. April 19-24, 2009. Taipei, China. IEEE, 2009. DOI: 10.1109/ICASSP.2009.4959741.
[5] 袁益琴, 何国金, 王桂周, 等. 背景差分与帧间差分相融合的遥感卫星视频运动车辆检测方法[J]. 中国科学院大学学报, 2018, 35(1): 50-58. DOI: 10.7523/j.issn.2095-6134.2018.01.007.
[6] 黄萍萍, 王峰, 向俞明, 等. 基于V-CSK视频遥感卫星运动目标检测跟踪方法[J]. 中国科学院大学学报, 2021, 38(3): 392-401. DOI: 10.7523/j.issn.2095-6134.2021.03.013.
[7] Ren S Q, He K M, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. DOI: 10.1109/TPAMI.2016.2577031.
[8] He K M, Gkioxari G, Dollar P, et al. Mask R-CNN[C]//2017 IEEE International Conference on Computer Vision (ICCV). October 22-29, 2017. Venice. IEEE, 2017: 2961-2969. DOI: 10.1109/iccv.2017.322.
[9] 王凤随, 王启胜, 陈金刚, 等. 基于注意力机制和Soft-NMS的改进Faster R-CNN目标检测算法[J]. 激光与光电子学进展, 2021, 58(24): 405-416. DOI: 10.3788/LOP202158.2420001.
[10] Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 27-30, 2016. Las Vegas, NV, USA. IEEE, 2016: 779-788. DOI: 10.1109/cvpr.2016.91.
[11] Redmon J, Farhadi A. YOLOv3: an incremental improvement[EB/OL]. arXiv: 1804.02767. (2018-04-08) [2022-10-07]. https://arxiv.org/abs/1804.02767.
[12] Bochkovskiy A, Wang C Y, Liao H Y M. YOLOv4: optimal speed and accuracy of object detection[EB/OL]. arXiv: 2004.10934. (2020-04-23)[2022-10-07]. https://arxiv.org/abs/2004.10934.
[13] 刘峰, 郭猛, 王向军. 基于跨尺度融合的卷积神经网络小目标检测[J]. 激光与光电子学进展, 2021, 58(6): 213-221. DOI: 10.3788/LOP202158.0610012.
[14] Liu W, Anguelov D, Erhan D, et al. SSD: single shot MultiBox detector[M]//Computer Vision-ECCV 2016. Cham: Springer International Publishing, 2016: 21-37. DOI: 10.1007/978-3-319-46448-0_2.
[15] Fu C Y, Liu W, Ranga A, et al. DSSD: Deconvolutional single shot detector[EB/OL]. arXiv: 1701.06659. (2017-01-23)[2022-10-07]. https:arxiv.org/abs/1701.06659.
[16] 耿鹏志, 杨智雄, 张家钧, 等. 基于SSD的行人鞋子检测算法[J]. 激光与光电子学进展, 2021, 58(6): 184-191. DOI: 10.3788/LOP202158.0610009.
[17] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). July 21-26, 2017, Honolulu, HI, USA. IEEE, 2017: 936-944. DOI: 10.1109/CVPR.2017.106.
[18] Liu S, Qi L, Qin H F, et al. Path aggregation network for instance segmentation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. June 18-23, 2018, Salt Lake City, UT, USA. IEEE, 2018: 8759-8768. DOI: 10.1109/CVPR.2018.00913.
[19] 汪亚妮, 汪西莉. 基于注意力和特征融合的遥感图像目标检测模型[J]. 激光与光电子学进展, 2021, 58(2): 363-371. DOI: 10.3788/LOP202158.0228003.
[20] Zhu X K, Lyu S C, Wang X, et al. TPH-YOLOv5: improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios[C]//2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). October 11-17, 2021, Montreal, BC, Canada. IEEE, 2021: 2778-2788. DOI: 10.1109/ICCVW54120.2021.00312.
[21] Woo S, Park J, Lee J Y, et al. CBAM: convolutional block attention module[M]//Computer Vision-ECCV 2018. Cham: Springer International Publishing, 2018: 3-19. DOI: 10.1007/978-3-030-01234-2_1.
[22] Hou Q B, Zhou D Q, Feng J S. Coordinate attention for efficient mobile network design[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 20-25, 2021, Nashville, TN, USA. IEEE, 2021: 13708-13717. DOI: 10.1109/CVPR46437.2021.01350.
[23] Liu Z, Lin Y T, Cao Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). October 10-17, 2021, Montreal, QC, Canada. IEEE, 2022: 9992-10002. DOI: 10.1109/ICCV48922.2021.00986.
[24] Long X, Deng K P, Wang G Z, et al. PP-YOLO: an effective and efficient implementation of object detector[EB/OL]. arXiv: 2007.12099. (2020-08-03)[2022-10-07]. https://arxiv.org/abs/2007.12099v3.
[25] Wang C Y, Yeh I H, Liao H Y M. You only learn one representation: unified network for multiple tasks[EB/OL]. arXiv: 2105.04206. (2021-05-10)[2022-10-07]. https://arxiv.org/abs/2105.04206v1.
[26] Wang C Y, Bochkovskiy A, Liao H Y M. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[EB/OL]. arXiv: 2207.02696. (2022-07-06)[2022-10-07]. https://arxiv.org/abs/2207.02696.