目标检测一直以来都是计算机视觉领域的研究热点之一,其任务是返回给定图像中的单个或多个特定目标的类别与矩形包围框坐标.随着神经网络研究的飞速进展,R-CNN检测器的诞生标志着目标检测正式进入深度学习时代,速度和精度相较于传统算法均有了极大的提升.但是,目标检测的尺度问题对于深度学习算法而言也始终是一个难题,即检测器对于尺度极大或极小目标的检测精度会显著下降,因此,近年来有不少学者在研究如何才能更好地实现多尺度目标检测.虽然已有一系列的综述文章从算法流程、网络结构、训练方式和数据集等方面对基于深度学习的目标检测算法进行了总结与分析,但对多尺度目标检测的归纳和整理却鲜有人涉足.因此,首先对基于深度学习的目标检测的两个主要算法流派的奠基过程进行了回顾,包括以R-CNN系列为代表的两阶段算法和以YOLO、SSD为代表的一阶段算法;然后,以多尺度目标检测的实现为核心,重点诠释了图像金字塔、构建网络内的特征金字塔等典型策略;最后,对多尺度目标检测的现状进行总结,并针对未来的研究方向进行展望.
Object detection is a classic computer vision task which aims to detect multiple objects of certain classes within a given image by bounding-box-level localization. With the rapid development of neural network technology and the birth of R-CNN detector as a milestone, a series of deep-learning-based object detectors have been developed in recent years, showing the overwhelming speed and accuracy advantage against traditional algorithms. However, how to precisely detect objects in large scale variance, also known as the scale problem, still remains a great challenge even for the deep learning methods, while many scholars have made several contributions to it over the last few years. Although there are already dozens of surveys focusing on the summarization of deep-learning-based object detectors in several aspects including algorithm procedure, network structure, training and datasets, very few of them concentrate on the methods of multi-scale object detection. Therefore, this paper firstly review the foundation of the deep-learning-based detectors in two main streams, including the two-stage detectors like R-CNN and one-stage detectors like YOLO and SSD. Then, the effective approaches are discussed to address the scale problems including most commonly used image pyramids, in-network feature pyramids, etc. At last, the current situations of the multi-scale object detection are concluded and the future research directions are looked ahead.
目标检测是一个重要的计算机视觉任务.它由图像分类任务发展而来, 区别在于不再仅仅只对一张图像中的单一类型目标进行分类, 而是要同时完成一张图像里可能存在的多个目标的分类和定位, 其中分类是指给目标分配类别标签, 定位是指确定目标的外围矩形框的顶点坐标.因此, 目标检测任务更具有挑战性, 也有着更广阔的应用前景, 比如自动驾驶、人脸识别、行人检测、医疗检测等等.同时, 目标检测也可以作为图像分割、图像描述、目标跟踪、动作识别等更复杂的计算机视觉任务的研究基础.
目标检测算法主要分为3个步骤: 图像特征提取、候选区域生成与候选区域分类.其中, 图像特征提取是整个检测流程的基石.传统算法普遍是基于人工设计的特征算子来描述图像, 例如SIFT特征[
基于深度学习的目标检测算法主要分为两个流派: (1) 以R-CNN系列为代表的两阶段算法; (2) 以YOLO[
目标检测任务中不同图像或同一图像内的目标存在的尺度差异
The scale variance of the objects within different images or one image in object detection task
目标检测包含了目标的定位和分类这两个子任务, 而尺度问题的根源就在于, 卷积神经网络在不断加深的过程中, 表达抽象特征的能力越来越强, 但浅层的空间信息也相对丢失.这就导致深层特征图无法提供细粒度的空间信息, 对目标进行精确定位.同时, 小目标的语义信息也在下采样的过程中逐渐丢失.因此, 解决尺度问题的一个通用的思路就是构建多尺度的特征表达.目前, 常用的构建多尺度特征的方法包括: (1) 采用图像金字塔, 将图像以不同的分辨率依次进行目标检测; (2) 在神经网络内部通过对不同深度的特征图进行跨层连接, 构建特征金字塔并进行目标检测; (3) 在神经网络内部设计感受野不同的并行支路, 构建空间金字塔并进行目标检测.除了构建多尺度的特征表达以外, 也有学者从算法流程中更细节的层面研究缩小不同尺度目标检测精度差距的策略, 包括锚点、交并比、动态卷积、边界框损失函数等等.
本文围绕基于深度学习的多尺度目标检测, 首先在第1节中对背景知识进行介绍, 包括从两阶段和一阶段这两类算法的角度回顾主流检测器的奠基过程, 以及尺度问题的提出.第2节介绍基于图像金字塔的多尺度目标检测.第3节介绍基于网络内特征金字塔的多尺度目标检测, 包括跨层连接和并行支路两种构建金字塔的方式.第4节介绍多尺度目标检测的锚点、交并比、动态卷积、边界框损失函数等策略.第5节对多尺度目标检测未来可能的研究方向和趋势进行展望.第6节对全文进行总结.
本节将对基于深度学习的目标检测算法中的两阶段算法和一阶段算法这两大主要流派的发展历史进行简要回顾.两阶段算法首先通过启发式方法或者卷积神经网络生成一系列可能存在潜在目标的候选区域, 然后根据候选区域的特征对每一个区域进行分类和边界回归.一阶段算法则省略了候选区域生成的步骤, 仅使用一个卷积神经网络直接完成整张图像上所有目标的定位与分类.两种流派各有优势: 两阶段算法相对而言精度更高, 尤其体现在定位上; 而一阶段算法的速度普遍更快, 更容易满足实际应用场景中的实时性需求.
传统目标检测算法的流程主要包括候选框生成、特征提取和目标分类这3个步骤, 而两阶段算法正是在传统算法的基础上一步步演化而来的.2014年, Girshick等人[
为了解决R-CNN算法的这些缺陷, 后续出现了一系列算法对其进行改进.He等人[
Fast R-CNN虽然成功地将分类和回归也整合进了神经网络, 但距离真正的端到端训练还差一步: 候选框的生成依旧是完全独立的.选择性搜索等传统算法是基于图像的底层视觉特征直接生成候选区域, 无法根据具体的数据集进行学习.而且, 选择性搜索非常耗时, 在CPU上处理一张图像需要2s.即便是当时能够最好地权衡候选框的生成质量与速度的EdgeBoxes算法[
自Faster R-CNN问世之后, 新诞生的两阶段检测器几乎都以它为雏形.Dai等人[
从Faster R-CNN开始的二阶段算法虽然已经实现了端到端训练的完整流程, 但是与真正满足实时性需求仍有相当大的差距.因此, 以YOLO算法[
(2) 采用同一个卷积神经网络作为共享的骨架网络, 通过更改网络头部分别实现分类、定位和检测任务, 这使得OverFeat比R-CNN的检测速度快了9倍[
YOLO诞生之后, 更多的一阶段检测器也相继问世.Liu等人[
对原始的YOLO进行全面升级之后, Redmon等人[
新诞生的这一系列一阶段检测器虽然普遍有着绝对的速度优势, 但与顶尖的二阶段检测器相比仍然存在着不可忽视的精度差距.Lin等人[
YOLOv2之后, Redmon等人再次对其进行升级, 提出了YOLOv3[
近年来, 在“舍弃锚点(anchor-free)”的潮流之下, 除了对传统的锚点策略进行反思与改进的算法之外[
方向.
为了对目标的尺度进行量化, 通常以目标实例所占面积(即掩膜所占的像素数量)除以所在图像的面积并开方得到的结果作为该目标实例的相对尺度(介于0~1之间), 简称尺度.因此, 不同图像中的目标的相对尺度存在很大差异, 或同一张图像中的多个目标的尺寸存在较大差异, 这一状况即被称为尺度问题, 一直以来都是影响目标检测任务精度的最核心的挑战之一, 即便是进入深度学习时代也同样如此.MS COCO数据集[
(1) 数据集包含了80种不同类别的目标, 覆盖范围广, 场景跨度大;
(2) 根据尺度对数据集中的所有目标实例进行排序后, 得到的尺度分布曲线如
MS COCO检测数据集的目标实例的尺度分布曲线
The scale distribution curve of the instances among MS COCO detection dataset
(3) 在数据集的评价指标中, 将面积小于32×32的实例视为小目标, 大于96×96的实例视为大目标, 剩下的实例视为中等目标.因此, 在评价模型时除了给出整体的准确率、召回率之外, 数据集还会分别计算并给出小、中、大目标3种情况下的准确率和召回率, 这有利于直观地看出模型在面对不同尺度目标时的检测能力.
目标检测算法在MS COCO测试集上的检测性能
Detection performance on the MS COCO TEST-DEV dataset
算法名称 | 骨架网络 | 年份 | AP | AP50 | AP75 | APS | APM | APL |
Faster R-CNN[ |
VGGNet-16[ |
2015 | 21.9 | 42.7 | - | - | - | - |
SSD512*[ |
VGGNet-16 | 2016 | 28.8 | 48.5 | 30.3 | 10.9 | 31.8 | 43.5 |
Faster R-CNN+++[ |
ResNet-101[ |
2016 | 34.9 | 55.7 | 37.4 | 15.6 | 38.7 | 50.9 |
R-FCN[ |
ResNet-101 | 2016 | 29.9 | 51.9 | - | 10.8 | 32.8 | 45.0 |
Faster R-CNN w FPN[ |
ResNet-101 | 2017 | 36.2 | 59.1 | 39.0 | 18.2 | 39.0 | 48.2 |
YOLOv2[ |
DarkNet-19 | 2017 | 21.6 | 44.0 | 19.2 | 5.0 | 22.4 | 35.5 |
DSSD513[ |
ResNet-101 | 2017 | 33.2 | 53.3 | 35.2 | 13.0 | 35.4 | 51.1 |
Mask R-CNN[ |
ResNet-101 | 2017 | 38.2 | 60.3 | 41.7 | 20.1 | 41.1 | 50.2 |
RetinaNet500[ |
ResNet-101 | 2017 | 34.4 | 53.1 | 36.8 | 14.7 | 38.5 | 49.1 |
RetinaNet800[ |
ResNet-101 | 2017 | 39.1 | 59.1 | 42.3 | 21.8 | 42.7 | 50.2 |
Cascade R-CNN[ |
ResNet-101 | 2018 | 42.8 | 62.1 | 46.3 | 23.7 | 45.5 | 55.2 |
PANet[ |
ResNeXt-101[ |
2018 | 47.4 | 67.2 | 51.8 | 30.1 | 51.7 | 60.0 |
YOLOv3[ |
DarkNet-53 | 2018 | 33.0 | 57.9 | 34.4 | 18.3 | 35.4 | 41.9 |
Faster R-CNN w SNIP++[ |
ResNet-101-Deformable[ |
2018 | 44.4 | 66.2 | 44.9 | 27.3 | 47.4 | 56.9 |
Faster R-CNN w SNIPER++[ |
ResNet-101-Deformable | 2018 | 46.1 | 67.0 | 51.6 | 29.6 | 48.9 | 58.1 |
RFB Net512-E[ |
VGGNet-16 | 2018 | 34.4 | 55.7 | 36.4 | 17.6 | 37.0 | 47.6 |
PFPNet-R512[ |
VGGNet-16 | 2018 | 35.2 | 57.6 | 37.9 | 18.7 | 38.6 | 45.9 |
Faster R-CNN w FPN | DetNet-59[ |
2018 | 40.3 | 62.1 | 43.8 | 23.6 | 42.6 | 50.0 |
CornerNet[ |
Hourglass-104[ |
2018 | 40.6 | 56.4 | 43.2 | 19.1 | 42.8 | 54.3 |
SOD-MTGAN[ |
ResNet-101 | 2018 | 41.4 | 63.2 | 45.4 | 24.7 | 44.2 | 52.6 |
STDN513[ |
DenseNet-169 | 2018 | 31.8 | 51.0 | 33.6 | 14.4 | 36.1 | 43.4 |
DES512[ |
VGGNet-16 | 2018 | 32.8 | 53.2 | 34.6 | 13.9 | 36.0 | 47.6 |
DCNv2[ |
ResNet-101-DeformableV2 | 2019 | 44.8 | 66.3 | 48.8 | 24.4 | 48.1 | 59.6 |
Grid R-CNN w FPN[ |
ResNet-101 | 2019 | 41.5 | 60.9 | 44.5 | 23.3 | 44.9 | 53.1 |
TridentNet[ |
ResNet-101 | 2019 | 42.7 | 63.6 | 46.5 | 23.9 | 46.6 | 56.6 |
TridentNet*++[ |
ResNet-101-Deformable | 2019 | 48.4 | 69.7 | 53.5 | 31.8 | 51.3 | 60.3 |
GA-Faster-RCNN w FPN[ |
ResNet-50 | 2019 | 39.8 | 59.2 | 43.5 | 21.8 | 42.6 | 50.7 |
FSAF[ |
ResNet-101 | 2019 | 40.9 | 61.5 | 44.0 | 24.0 | 44.2 | 51.3 |
FCOS w FPN[ |
ResNet-101 | 2019 | 41.5 | 60.7 | 45.0 | 24.4 | 44.8 | 51.6 |
CenterNet[ |
Hourglass-104 | 2019 | 42.1 | 61.1 | 45.9 | 24.1 | 45.5 | 52.8 |
YOLOv3@800 w ASFF*[ |
DarkNet-53 | 2019 | 43.9 | 64.1 | 49.2 | 27.0 | 46.6 | 53.4 |
Double-Head-Ext[ |
ResNet-101 | 2020 | 42.3 | 62.8 | 46.3 | 23.9 | 44.9 | 54.3 |
Faster R-CNN w AugFPN[ |
ResNet-101 | 2020 | 41.5 | 63.9 | 45.1 | 23.8 | 44.7 | 52.8 |
ATSS[ |
ResNet-101-Deformable | 2020 | 46.3 | 64.7 | 50.4 | 27.7 | 49.8 | 58.4 |
TSD[ |
ResNet-101 | 2020 | 43.2 | 64.0 | 46.9 | 24.0 | 46.3 | 55.8 |
D2Det w FPN[ |
ResNet-101 | 2020 | 45.4 | 64.0 | 49.5 | 25.8 | 48.7 | 58.1 |
Dynamic R-CNN[ |
ResNet-101 | 2020 | 42.0 | 60.7 | 45.9 | 22.7 | 44.3 | 54.3 |
YOLOv4[ |
CPSDarkNet-53 | 2020 | 43.5 | 65.7 | 47.3 | 26.7 | 46.7 | 53.3 |
检测器在面对尺度跨度较大的数据集时会表现不佳的根本原因在于, 卷积神经网络在不断加深的过程中, 表达抽象特征的能力越来越强, 浅层的空间信息也相对丢失.以基于ResNet-50的Faster R-CNN为例, 在检测包含多尺度目标的图像时, 可以看到
Faster R-CNN目标检测结果及骨架网络ResNet-50特征可视化
The visualization of detection results and backbone network features of ResNet-50-based Faster R-CNN
同一张图像, 在低分辨率下能看到整体的轮廓, 在高分辨率下能看清更多的细节, 这正是图像金字塔的基本原理.早在目标检测进入深度学习时代之前, 图像金字塔就已成为一种通用的提高检测精度的手段, 比如用同样大小的滑动窗口在不同尺度的图像上进行特征提取.而神经网络的本质也是特征感知, 因此图像金字塔同样适用于神经网络, 如下文中
基于图像金字塔的多尺度目标检测总览
Overview of the image-pyramid-based multi-scale object detection
Hao等人[
从本质上讲, 尺度生成网络其实是提供了构建图像金字塔的参考意见, 让金字塔的层数和每一层的分辨率都更适应于具体图像, 有效提高了算法的检测效率.值得深究的一点是, 为什么卷积神经网络固定的感受野不适用于多尺度目标检测, 但却可以估计出图像中目标的尺度分布呢?作者基于尺度生成网络的响应图, 给出的解释是, 对于人脸检测, 即便是感受野受限, 比如只能看清眼睛, 也能根据眼睛的尺度相对估计出整张脸的尺度.不过, 人脸检测有其特殊性, 在面对目标类型更丰富的通用目标检测任务时, 该算法的思路是否仍然适用, 需要进一步的实验验证.
Singh等人[
之后, Singh等人将SNIP升级为SNIPER[
有一系列文献基于注意力机制的思想, 通过引入放大操作, 重点关注图的某个区域, 以自适应的方式实现多尺度目标检测.最早在深度学习目标检测中引入放大操作的是Lu等人提出的AZ-Net[
Gao等人[
早期以R-CNN[
考虑到卷积神经网络层层相叠的结构, 越深的特征图, 其感受野越大, 因此, 网络内不同深度的特征图就形成了天然的多尺度表达, 于是SSD算法[
针对SSD算法存在的缺陷, Lin等人提出了著名的特征金字塔网络FPN[
为了对FPN提出的特征融合的效果有一个直观的认识, 下文
对比YOLOv3浅层特征在融合深层特征前后的可视化
The comparison of the visualizations of the shallow features of YOLOv3 detector under circumstances whether concatenating deep features or not
FPN的核心思想在于融合网络内部的不同深度的特征信息, 但是, 由上至下逐层融合的结构是值得商榷的, 因此出现了一系列对此进行讨论和改进的算法.Liu等人[
基于跨层连接构建网络内特征金字塔的多种方式
Overview of different ways to construct the in-network feature pyramids through cross-layer connections
Kong等人[
Pang等人[
Liu等人[
Guo等人[
Tan等人[
PANet、Libra R-CNN等方案都是人工设计网络内特征金字塔的构造方式, 而在2019年, Ghaisi等人[
上述文献都是针对FPN提出的特征融合的方式而做出改变, Li等人[
要构建多尺度的特征表达, 除了使用图像金字塔或在网络内融合不同深度的特征层构建特征金字塔以外, 还有一种方案是在网络内设计参数不同的并行支路, 每条支路基于各自的感受野提取不同空间尺度下的特征图, 进而构建出了空间金字塔.空间金字塔这一概念最早源于Lazebnik等人[
基于并行支路构建网络内特征金字塔的多种方式
Overview of different ways to construct the in-network feature pyramids through parallel branches
受SPP-Net的SPP模块的启发, Chen等人[
Zhao等人[
SSD[
尽管TridentNet的计算开销赶不上图像金字塔, 但多条支路的输出仍然是比较耗时的.为了保证算法的实用性, 他们额外设计了训练时在所有分支训练所有目标、但推断时只使用中间分支的TridentNet Fast, 相当于网络内部的多尺度数据增强, 在没有引入任何额外计算量的前提下精度相较于基准线提升了2.7, 与原版TridentNet的差距也只有0.6.虽然TridentNet的本质在于学习尺度不变的特征, 但是, 为什么舍弃了SNIP训练策略的TridentNet Fast也能够取得接近TridentNet的实验结果呢?作者给出的猜测是由于权重共享, 更具体的原因还有待进一步研究.
无论是使用图像金字塔, 还是在网络内构建特征金字塔, 都是利用基于多尺度的特征来解决目标检测的尺度问题.除了这一思路以外, 也有学者从检测算法流程中更细节的层面去解决尺度问题, 包括锚点、动态卷积、基于生成对抗网络重建特征等.本节将对这些策略一一进行概述.
早期的目标检测为了检测到不同尺度的目标, 除了采用固定大小的滑动窗口在图像金字塔上逐层滑动以外, 还可以采用不同大小的滑动窗口轮流在同一张图上滑动.Ren等人[
也正是考虑到锚点的影响, Ming等人[
基于Faster R-CNN在WIDER FACE验证集上的人脸检测性能量化对比不同的锚点策略
Quantitative comparison of different anchor strategies based on Faster R-CNN's face detection performance on WIDER FACE validation set
不同方法 | 预测特征层 | 锚点步长 | 是否分组采样 | All | Easy | Medium | Hard |
FPN | P2~P5 | {4, 8, 16, 32} | 否 | 82.1 | 90.9 | 91.3 | 87.6 |
FPN-finest-stride | P2 | {4, 8, 16, 32} | 否 | 81.6 | 90.4 | 91.0 | 87.1 |
FPN-finest | P2 | 4 | 否 | 80.2 | 94.1 | 93.0 | 86.6 |
FPN-finest-sampling | P2 | 4 | 是 | 82.8 | 94.7 | 93.8 | 88.7 |
多尺度的锚点设置虽然已成为多数检测器面对尺度问题的标配, 但是近年来越来越多的学者意识到锚点策略所存在的天然缺陷.例如, 以FPN为例, Zhu等人[
同样地, Wang等人[
在目标检测的训练过程中, 我们通常是基于预测矩形框和真实标签的交并比来确定正负样本, 譬如交并比大于0.5的为正样本, 小于0.3的为负样本.但是, 这样的阈值设定主要是基于经验, 并不一定是最优选择.而且, 采用固定的交并比阈值对于多尺度目标检测来说更加不合适, 因为相等的坐标偏差会对小目标的交并比造成更大的影响, 对于大目标的影响则微弱得多.为了尝试解决这一问题, Cai等人[
传统的卷积神经网络存在着一个固有缺陷: 卷积核的大小是固定的, 池化层的尺度也是固定的, 这就导致了网络内所有特征层的感受野始终是固定的, 不利于感知不同尺度的目标.因此, 便有了一系列方法尝试着将卷积操作动态化.例如, 空洞卷积[
L1和L2范数是经典的回归损失函数, 在目标检测任务中可以用于对边界框进行回归.但是L1损失函数的收敛速度较慢且解不稳定, L2损失函数对离群点敏感而不够鲁棒.因此, Girshick[
Rezatofighi等人[
Zheng等人[
基于Faster R-CNN在MS COCO测试集上的检测性能量化对比不同的损失函数
Quantitative comparison of different loss functions based on FasterR-CNN's detection performance on the MS COCO TEST-DEV dataset
损失函数 | AP | AP75 | APS | APM | APL |
交并比(IoU)损失函数 | 37.93 | 40.79 | 21.58 | 40.82 | 50.14 |
广义交并比(GIoU)损失函数 | 38.02 | 41.11 | 21.45 | 41.06 | 50.21 |
距离交并比(DIoU)损失函数 | 38.09 | 41.11 | 41.18 | 50.32 | |
完全交并比(CIoU)损失函数 | 21.32 |
目标检测任务包含了目标分类和目标定位两部分, Faster R-CNN等传统算法在第2阶段普遍通过共享的全连接层对候选区域进行特征提取, 最后再在两个分支上分别进行分类和回归.但是, 这种做法的合理性值得商榷.Song等人[
Lu等人[
在MS-CNN算法[
除此以外, 生成对抗网络的出现, 同样为小目标的特征重建提供了新的思路.Li等人[
此外, 数据增强同样是缓解尺度问题的可行方案, 比如YOLOv2算法[
在目标检测任务中, 为了提高检测器的整体性能, 通常会采用额外的数据集对模型进行预训练, 然后再在正式的数据集上进行微调, 亦或是直接让额外的数据集参与联合训练.但是, Yu等人[
Chen等人[
CutMix与Mosaic数据增强
CutMix and Mosaic data augmentation
这样做能够提高训练效率、增强模型的鲁棒性.Bochkovskiy等人[
基于Faster R-CNN在MS COCO测试集上的检测性能量化评估图像拼接数据增强及其训练策略
Quantitative evaluation of Stitcher augmentation and corresponding training strategies based onFaster R-CNN's detection performance on the MS COCO TEST-DEV dataset
网络架构 | 是否使用Stitcher | 训练时间 | AP | AP50 | AP75 | APS | APM | APL |
ResNet-50 Faster R-CNN w FPN | 否 | 1倍 | 36.7 | 58.4 | 39.6 | 21.1 | 39.8 | 48.1 |
2倍 | 37.7 | 59.2 | 41.0 | 21.6 | 40.6 | 49.6 | ||
4倍 | 37.3 | 58.1 | 40.1 | 20.3 | 39.6 | 50.1 | ||
6倍 | 35.6 | 55.9 | 38.4 | 19.8 | 37.7 | 47.6 | ||
是 | 1倍 | 38.6 | 60.5 | 41.8 | 24.4 | 41.9 | 49.3 | |
6倍 | 40.4 | 62.5 | 44.2 | 26.1 | 43.1 | 51.5 |
多尺度目标检测一直以来是一个研究难点.结合目前已有的方案, 本节总结了一些值得深入探讨的问题, 可作为未来研究的方向.
(1) “碎片式”图像金字塔.SNIP[
(2) 高分辨率图像的多尺度目标检测.在对高分辨率图像进行目标检测时, 往往并不缺少小目标的细节信息, 而是难以实现精度与计算资源的权衡.由于受到内存、检测速度需求等限制, Faster R-CNN[
(3) 特征金字塔架构下的特征融合.自特征金字塔网络(FPN)[
(4) 神经网络架构搜索.近年来兴起的神经网络架构搜索(NAS)方法在目标检测任务上展现出了相当大的潜力: 通过NAS得到的特征金字塔网络结构[
(5) 卷积神经网络能否理解尺度概念.特征金字塔网络的出发点是在不同尺度的特征图上检测不同尺度的目标, 但是, 这也意味着网络实际上是在把不同尺度的目标当作不同的目标在检测, 即便它们可能真的就是同一个目标.因此, 卷积神经网络可能并没有真正理解尺度这一概念, 只是在依靠庞大的参数量来强行记忆.这可能也正是引入FPN架构的算法在检测大目标时精度普遍会有所下降的原因.TridentNet[
(6) 锚点的存在价值.早期的YOLO和Densebox[
(7) 多尺度目标的交并比.现有的多数目标检测算法在训练过程中, 仍然是基于固定的交并比阈值来确定正负样本.尽管已有部分学者[
(8) 数据集的尺度不均衡.特征金字塔网络等算法很多时候会给我们带来错觉, 认为模型在分配了更多资源提升小目标的检测精度后, 大目标的精度也会有所下降, 是学习资源受限的必然趋势.但是, 针对锚点设置而提出的分组采样策略[
本文以基于深度学习的目标检测为背景, 首先对主流算法的成型历史进行了简要回顾, 包括R-CNN等两阶段检测算法和YOLO等一阶段检测算法.然后, 本文总结了近几年来提出的众多检测算法在MS COCO检测数据集上的表现, 并以相应评价指标为依据, 指出了目标检测所面临的尺度问题这一巨大挑战, 并分析了其根本原因在于目标定位所需要的浅层空间信息和目标分类所需要的深层语义信息的矛盾.
以解决尺度问题为导向, 本文对现有的多尺度目标检测策略进行了汇总和归纳.其中, 构建多尺度特征表达是最典型且宏观的策略, 具体可以分为图像金字塔和网络内特征金字塔.前者将多尺度的图像通入网络, 能够稳定提升检测精度, 但显著增加的内存开销和计算耗时是主要问题.后者则只需输入原图, 根据特征金字塔构建方式的不同可分为跨层连接和并行支路, 计算代价比图像金字塔更小.除此以外, 本文也从锚点、交并比阈值、动态卷积、边界框损失函数等更细节的层面分析了能够改善尺度问题的策略, 更透彻地了解了检测算法流程的许多细节设计的意义.
最后, 本文基于上述分析, 对多尺度目标检测的研究方向进行了展望.譬如, 能否构建“碎片式”图像金字塔解决计算耗时问题、堆叠特征融合操作的上限在何处、卷积神经网络能否理解尺度这一概念以及数据集中是否存在不同尺度目标的过拟合和欠拟合问题等等.这些疑问都值得继续深入探讨.
Lowe DG. Distinctive image features from scale-invariant keypoints. Int'l Journal of Computer Vision, 2004, 60(2): 91-110.
Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proc. of the Computer Vision and Pattern Recognition. 2005, 1: 886-893.
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Proc. of the Neural Information Processing Systems. 2012. 1097-1105.
Deng J, Dong W, Socher R, Li LJ, Li K, Li FF. Imagenet: A large-scale hierarchical image database. In: Proc. of the Computer Vision and Pattern Recognition. 2009. 248-255.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv Preprint arXiv: 1409.1556, 2014.
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proc. of the Computer Vision and Pattern Recognition. 2015. 1-9.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proc. of the Computer Vision and Pattern Recognition. 2016. 770-778.
Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proc. of the Computer Vision and Pattern Recognition. 2014. 580-587.
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A. The pascal visual object classes (VoC) challenge. Int'l Journal of Computer Vision, 2010, 88(2): 303-338.
Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D. Object detection with discriminatively trained part-based models. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2009, 32(9): 1627-1645.
Jiao L, Zhang F, Liu F, Yang S, Li L, Feng Z, Qu R. A survey of deep learning-based object detection. IEEE Access, 2019, 7: 128837-128868.
Wu X, Sahoo D, Hoi SCH. Recent advances in deep learning for object detection. Neurocomputing, 2020.
Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, Pietikäinen M. Deep learning for generic object detection: A survey. Int'l Journal of Computer Vision, 2020, 128(2): 261-318.
Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. In: Proc. of the Computer Vision and Pattern Recognition. 2016. 779-788.
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC. SSD: Single shot multibox detector. In: Proc. of the European Conf. on Computer Vision. 2016. 21-37.
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft coco: Common objects in context. In: Proc. of the European Conf. on Computer Vision. 2014. 740-755.
Uijlings JRR, Van De Sande KEA, Gevers T, Smeulders AWM. Selective search for object recognition. Int'l Journal of Computer Vision, 2013, 104(2): 154-171.
He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904-1916.
Girshick R. Fast R-CNN. In: Proc. of the Int'l Conf. on Computer Vision. 2015. 1440-1448.
Zitnick CL, Dollár P. Edge boxes: Locating object proposals from edges. In: Proc. of the European Conf. on Computer Vision. 2014. 391-405.
Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proc. of the Neural Information Processing Systems. 2015. 91-99.
Dai J, Li Y, He K, Sun J. R-FCN: Object detection via region-based fully convolutional networks. In: Proc. of the Neural Information Processing Systems. 2016. 379-387.
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. In: Proc. of the Computer Vision and Pattern Recognition. 2017. 2117-2125.
He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. In: Proc. of the Int'l Conf. on Computer Vision. 2017. 2961-2969.
Qin Z, Li Z, Zhang Z, Bao Y, Yu G, Peng Y, Sun J. ThunderNet: Towards real-time generic object detection on mobile devices. In: Proc. of the Int'l Conf. on Computer Vision. 2019. 6718-6727.
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv Preprint arXiv: 1312.6229, 2013.
Redmon J, Farhadi A. YOLO9000: Better, faster, stronger. In: Proc. of the Computer Vision and Pattern Recognition. 2017. 7263-7271.
Lin TY, Goyal P, Girshick R, He K, Dollar P. Focal loss for dense object detection. In: Proc. of the Int'l Conf. on Computer Vision. 2017. 2980-2988.
Redmon J, Farhadi A. Yolov3: An incremental improvement. arXiv Preprint arXiv: 1804.02767, 2018.
Zhu C, He Y, Savvides M. Feature selective anchor-free module for single-shot object detection. In: Proc. of the Computer Vision and Pattern Recognition. 2019. 840-849.
Wang J, Chen K, Yang S, Loy CC, Lin D. Region proposal by guided anchoring. In: Proc. of the Computer Vision and Pattern Recognition. 2019. 2965-2974.
Tian Z, Shen C, Chen H, He T. FCOS: Fully convolutional one-stage object detection. In: Proc. of the Int'l Conf. on Computer Vision. 2019. 9627-9636.
Law H, Deng J. Cornernet: Detecting objects as paired keypoints. In: Proc. of the European Conf. on Computer Vision. 2018. 734-750.
Zhou X, Wang D, Krähenbühl P. Objects as points. arXiv Preprint arXiv: 1904.07850, 2019.
Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation. In: Proc. of the European Conf. on Computer Vision. 2016. 483-499.
Fu CY, Liu W, Ranga A, Tyagi A, Berg AC. DSSD: Deconvolutional single shot detector. arXiv Preprint arXiv: 1701.06659, 2017.
Cai Z, Vasconcelos N. Cascade R-CNN: Delving into high quality object detection. In: Proc. of the Computer Vision and Pattern Recognition. 2018. 6154-6162.
Liu S, Qi L, Qin H, Shi J, Jia J. Path aggregation network for instance segmentation. In: Proc. of the Computer Vision and Pattern Recognition. 2018. 8759-8768.
Xie S, Girshick R, Dollár P, Tu Z, He K. Aggregated residual transformations for deep neural networks. In: Proc. of the Computer Vision and Pattern Recognition. 2017. 1492-1500.
Singh B, Davis LS. An analysis of scale invariance in object detection snip. In: Proc. of the Computer Vision and Pattern Recognition. 2018. 3578-3587.
Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y. Deformable convolutional networks. In: Proc. of the Int'l Conference on Computer Vision. 2017. 764-773.
Singh B, Najibi M, Davis LS. SNIPER: Efficient multi-scale training. In: Proc. of the Neural Information Processing Systems. 2018. 9310-9320.
Liu S, Huang D. Receptive field block net for accurate and fast object detection. In: Proc. of the European Conf. on Computer Vision. 2018. 385-400.
Kim SW, Kook HK, Sun JY, Kang MC, Ko SJ. Parallel feature pyramid network for object detection. In: Proc. of the European Conf. on Computer Vision. 2018. 234-250.
Li Z, Peng C, Yu G, Zhang X, Deng Y, Sun J. Detnet: A backbone network for object detection. arXiv Preprint arXiv: 1804.06215, 2018.
Bai Y, Zhang Y, Ding M, Ghanem B. SOD-MTGAN: Small object detection via multi-task generative adversarial network. In: Proc. of the European Conf. on Computer Vision. 2018. 206-221.
Zhou P, Ni B, Geng C, Hu J, Xu Y. Scale-transferrable object detection. In: Proc. of the Computer Vision and Pattern Recognition. 2018. 528-537.
Zhang Z, Qiao S, Xie C, Shen W, Wang Bo, Yuille AL. Single-shot object detection with enriched semantics. In: Proc. of the Computer Vision and Pattern Recognition. 2018. 5813-5821.
Zhu X, Hu H, Lin S, Dai J. Deformable convnets v2: More deformable, better results. In: Proc. of the Computer Vision and Pattern Recognition. 2019. 9308-9316.
Lu X, Li B, Yue Y, Li Q, Yan J. Grid R-CNN. In: Proc. of the Computer Vision and Pattern Recognition. 2019. 7363-7372.
Li Y, Chen Y, Wang N, Zhang Z. Scale-aware trident networks for object detection. In: Proc. of the Int'l Conf. on Computer Vision. 2019. 6054-6063.
Liu S, Huang D, Wang Y. Learning spatial fusion for single-shot object detection. arXiv Preprint arXiv: 1911.09516, 2019.
Song G, Liu Y, Wang X. Revisiting the sibling head in object detector. In: Proc. of the Computer Vision and Pattern Recognition. 2020. 11563-11572.
Guo C, Fan B, Zhang Q, Xiang S, Pan C. AUGFPN: Improving multi-scale feature learning for object detection. In: Proc. of the Computer Vision and Pattern Recognition. 2020. 12595-12604.
Zhang S, Chi C, Yao Y, Lei Z, Li SZ. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proc. of the Computer Vision and Pattern Recognition. 2020. 9759-9768.
Wu Y, Chen Y, Yuan L, Liu Z, Wang L, Li H, Fu Y. Rethinking classification and localization for object detection. In: Proc. of the Computer Vision and Pattern Recognition. 2020. 10186-10195.
Cao J, Cholakkal H, Anwer RM, Khan FS, Peng Y, Shao L. D2Det: Towards high quality object detection and instance segmentation. In: Proc. of the Computer Vision and Pattern Recognition. 2020. 11485-11494.
Zhang H, Chang H, Ma B, Wang N, Chen X. Dynamic R-CNN: Towards high quality object detection via dynamic training. arXiv Preprint arXiv: 2004.06002, 2020.
Bochkovskiy A, Wang CY, Liao HYM. YOLOv4: Optimal speed and accuracy of object detection. arXiv Preprint arXiv: 2004. 10934, 2020.
Hao Z, Liu Y, Qin H, Yan J, Li X, Hu X. Scale-aware face detection. In: Proc. of the Computer Vision and Pattern Recognition. 2017. 6186-6195.
Jain V, Learned-Miller E. FDDB: A benchmark for face detection in unconstrained settings. UMass Amherst Technical Report, 2010, 2(4).
Zhu X, Ramanan D. Face detection, pose estimation, and landmark localization in the wild. In: Proc. of the Computer Vision and Pattern Recognition. 2012. 2879-2886.
Yang B, Yan J, Lei Z, Li SZ. Fine-grained evaluation on face detection in the wild. In: Proc. of the Int'l Conf. and Workshops on Automatic Face and Gesture Recognition. 2015, 1: 1-7.
Lu Y, Javidi T, Lazebnik S. Adaptive object detection using adjacency and zoom prediction. In: Proc. of the Computer Vision and Pattern Recognition. 2016. 2351-2359.
Gao M, Yu R, Li A, Morariu VI, Davis LS. Dynamic zoom-in network for fast object detection in large images. In: Proc. of the Computer Vision and Pattern Recognition. 2018. 6926-6935.
Dollár P, Wojek C, Schiele B, Perona P. Pedestrian detection: A benchmark. In: Proc. of the Computer Vision and Pattern Recognition. IEEE, 2009. 304-311.
Kalkowski S, Schulze C, Dengel A, Borth D. Real-time analysis and visualization of the YFCC100M dataset. In Proc. of the Workshop on Community-Organized Multimodal Mining: Opportunities for Novel Solutions. 2015. 25-30.
Uzkent B, Yeh C, Ermon S. Efficient object detection in large images using deep reinforcement learning. In: Proc. of the Winter Conf. on Applications of Computer Vision. 2020. 1824-1833.
Lam D, Kuzma R, McGee K, Dooley S, Laielli M, Klaric M, Bulatov Y, McCord B. xview: Objects in context in overhead imagery. arXiv Preprint arXiv: 1802.07856, 2018.
Cai Z, Fan Q, Feris RS, Vasconcelos N. A unified multi-scale deep convolutional neural network for fast object detection. In: Proc. of the European Conf on Computer Vision. 2016. 354-370.
Kong T, Sun F, Tan C, Liu H, Huang W. Deep feature pyramid reconfiguration for object detection. In: Proc. of the European Conf. on Computer Vision. 2018. 169-185.
Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proc. of the Computer Vision and Pattern Recognition. 2018. 7132-7141.
Pang J, Chen K, Shi J, Feng H, Ouyang W, Lin D. Libra R-CNN: Towards balanced learning for object detection. In: Proc. of the Computer Vision and Pattern Recognition. 2019. 821-830.
Wang X, Girshick R, Gupta A, He K. Non-local neural networks. In: Proc. of the Computer Vision and Pattern Recognition. 2018. 7794-7803.
Tan M, Pang R, Le QV. Efficientdet: Scalable and efficient object detection. In: Proc. of the Computer Vision and Pattern Recognition. 2020. 10781-10790.
Tan M, Le QV. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv Preprint arXiv: 1905.11946, 2019.
Ghiasi G, Lin TY, Le QV. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In: Proc. of the Computer Vision and Pattern Recognition. 2019. 7036-7045.
Wang N, Gao Y, Chen H, Wang P, Tian Z, Shen C, Zhang Y. NAS-FCOS: Fast neural architecture search for object detection. In: Proc. of the Computer Vision and Pattern Recognition. 2020. 11943-11951.
Lazebnik S, Schmid C, Ponce J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proc. of the Computer Vision and Pattern Recognition. 2006, 2: 2169-2178.
Sivic J, Zisserman A. Video Google: A text retrieval approach to object matching in videos. In: Proc. of the Int'l Conf. on Computer Vision. 2003. 1470-1478.
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv Preprint arXiv: 1412.7062, 2014.
Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. arXiv Preprint arXiv: 1511.07122, 2015.
Chen LC, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. arXiv Preprint arXiv: 1706.05587, 2017.
Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid scene parsing network. In: Proc. of the Computer Vision and Pattern Recognition. 2017. 2881-2890.
Ming X, Wei F, Zhang T, Chen D, Wen F. Group sampling for scale invariant face detection. In: Proc. of the Computer Vision and Pattern Recognition. 2019. 3446-3456.
Yang S, Luo P, Loy CC, Tang X. Wider face: A face detection benchmark. In: Proc. of the Computer Vision and Pattern Recognition. 2016. 5525-5533.
Ke W, Zhang T, Huang Z, Ye Q, Liu J, Huang D. Multiple anchor learning for visual object detection. In: Proc. of the Computer Vision and Pattern Recognition. 2020. 10206-10215.
Chen Y, Dai X, Liu M, Chen D, Yuan L, Liu Z. Dynamic convolution: Attention over convolution kernels. In: Proc. of the Computer Vision and Pattern Recognition. 2020. 11030-11039.
Howard A, Sandler M, Chu G, Chen LC, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V, Le QV, Adam H. Searching for mobilenetv3. In: Proc. of the Int'l Conf. on Computer Vision. 2019. 1314-1324.
Yu J, Jiang Y, Wang Z, Cao Z, Huang T. Unitbox: An advanced object detection network. In: Proc. of the ACM Int'l Conf. on Multimedia. 2016. 516-520.
Rezatofighi H, Tsoi N, Gwak JY, Sadeghian A, Reid I, Savarese S. Generalized intersection over union: A metric and a loss for bounding box regression. In: Proc. of the Computer Vision and Pattern Recognition. 2019. 658-666.
Zheng Z, Wang P, Liu W, Li J, Ye R, Ren D. Distance-IoU loss: Faster and better learning for bounding box regression. In: Proc. of the American Association for Artificial Intelligence. 2020. 12993-13000.
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proc. of the Computer Vision and Pattern Recognition. 2017. 4700-4708.
Li J, Liang X, Wei Y, Xu T, Feng J, Yan S. Perceptual generative adversarial networks for small object detection. In: Proc. of the Computer Vision and Pattern Recognition. 2017. 1222-1230.
Zhu Z, Liang D, Zhang S, Huang X, Li B, Hu S. Traffic-sign detection and classification in the wild. In: Proc. of the Computer Vision and Pattern Recognition. 2016. 2110-2118.
Kisantal M, Wojna Z, Murawski J, Naruniec J, Cho K. Augmentation for small object detection. arXiv Preprint arXiv: 1902.07296, 2019.
Yu X, Gong Y, Jiang N, Ye Q, Han Z. Scale match for tiny person detection. In: Proc. of the Winter Conf. on Applications of Computer Vision. 2020. 1257-1265.
Chen Y, Zhang P, Li Z, Li Y, Zhang X, Meng G, Xiang S, Sun J, Jia J. Stitcher: Feedback-driven data provider for object detection. arXiv Preprint arXiv: 2004.12432, 2020.
Yun S, Han D, Oh SJ, Chun S, Choe J, Yoo Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In: Proc. of the Int'l Conf. on Computer Vision. 2019. 6023-6032.
Chen Y, Yang T, Zhang X, Meng G, Xiao X, Sun J. DetNAS: Backbone search for object detection. In: Proc. of the Neural Information Processing Systems. 2019. 6638-6648.
Huang L, Yang Y, Deng Y, Yu Y. Densebox: Unifying landmark localization with end to end object detection. arXiv Preprint arXiv: 1509.04874, 2015.