YOLOv1论文中英文对照翻译

时间：2023-11-30 14:37:02 1226s2no接近传感器 1212s2to接近传感器 1260s2no接近传感器

YOLOv1论文（附论文下载超链接）：You Only Look Once: Unified, Real-Time Object Detection

声明：论文翻译仅用于学习。请注明转载的来源

You Only Look Once: Unified, Real-Time Object Detection

We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

我们提出了YOLO，检测目标的新方法。使用分类器重新检测之前的目标检测工作。相反，我们认为目标检测是一个回归问题，即空间分离的边界框和相关类别概率。一个单一的神经网络在一次评估中直接从完整的图像中预测出边界框和类别概率。由于整个检测管道是一个单一的网络，它可以直接优化端到端的检测性能。

我们的统一结构非常快。我们的基本YOLO模型以每秒45帧的速度实时处理图像。网络的一个小版本，即Fast YOLO，处理速度达到每秒惊人的155帧，同时其他实时检测器的两倍mAP。与最先进的检测系统相比，YOLO定位错误较多，但在背景上预测假阳性的可能性较小。最后，YOLO学习了非常常见的目标表征。当它从自然图像泛化到艺术等其他领域时，它优于其他检测方法，包括DPM和R-CNN。

1. Introduction

1. 引言

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10].

当人类看图像时，他们可以立即知道图像中的物体是什么，它们在哪里，它们如何互动。人类的视觉系统快速准确，使我们能够在几乎无意识的情况下完成复杂的任务，如驾驶。快速准确的目标检测算法将使计算机能够在没有特殊传感器的情况下驾驶汽车，使辅助设备能够向人类用户传达实时场景信息，并释放一般反应灵敏的机器人系统的潜力。

目前的检测系统重新使用分类器进行检测。为了检测目标，该系统使用目标分类器，并评估测试图像的不同位置和比例。像可变部件模型一样（DPM）该系统采用滑动窗法，在整个图像上均匀间隔运行分类器[10]。

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.

最近的方法如R-CNN使用区域推荐方法，首先在图像中生成潜在的边界框，然后在这些推荐的框上操作分类器。分类后，后处理用于细化边界框，消除重复检测，并根据场景中的其他目标重新评分[13]。由于每个单独的组件都必须单独训练，因此这些复杂的管道非常缓慢且难以优化。

从图像素到边界框坐标和类别概率，我们将目标检测重塑为一个单一的回归问题。使用我们的系统，您只需一次（YOLO）图像可以预测哪些目标存在，它们在哪里。
在这里插入图片描述

Figure 1: The YOLO Detection System. Processing images with YOLO is simple and straightforward. Our system (1) resizes the input image to 448 × 448, (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence.

**图1：YOLO检测系统。**用YOLO图像处理简单明了。我们的系统(1)将输入图像的大小调整为448×448，（）在图像上运行一个单一的卷积网络，（3）根据模型的置信度对产生的检测结果进行阈值处理。

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running in real-time on a webcam please see our project webpage: http://pjreddie.com/yolo/.

YOLO简单得令人耳目一新：见图1。一个卷积网络同时预测多个边界框和这些框的类别概率。YOLO在完整的图像上进行训练，直接优化检测性能。与传统的目标检测方法相比，这种统一的模型有几个好处。

首先，YOLO的速度非常快。由于我们把检测看作是一个回归问题，所以我们不需要一个复杂的流程。我们只需在测试时在新图像上运行我们的神经网络来预测检测结果。我们的基本网络在Titan X GPU上以每秒45帧的速度运行，没有批量处理，快速版本的运行速度超过150帧。这意味着我们可以实时处理流媒体视频，延迟时间不到25毫秒。此外，YOLO达到了其他实时系统平均精度的两倍以上。关于我们的系统在网络摄像头上实时运行的演示，请看我们的项目网页：http://pjreddie.com/yolo/。

Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.

第二，YOLO在进行预测时对图像进行全局推理。与滑动窗口和基于区域建议的技术不同，YOLO在训练和测试期间可以看到整个图像，因此它隐含地编码了关于类别和外观的上下文信息。Fast R-CNN是一种顶部检测方法[14]，它将图像中的背景斑块误认为是目标，因为它不能看到更大的背景。与Fast R-CNN相比，YOLO的背景错误数量不到一半。

Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments.

All of our training and testing code is open source. A variety of pretrained models are also available to download.

第三，YOLO学习了目标的可概括性表征。当对自然图像进行训练并对艺术品进行测试时，YOLO的性能远远超过了DPM和R-CNN等顶级检测方法。由于YOLO具有高度的通用性，它在应用于新领域或意外输入时不太可能崩溃。

YOLO在准确性方面仍然落后于最先进的检测系统。虽然它可以快速识别图像中的物体，但它在精确定位一些物体，特别是小物体方面却很困难。我们在实验中进一步研究这些权衡。

我们所有的训练和测试代码都是开源的。各种预训练的模型也可以下载。

2. Unified Detection

2. 统一检测

We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision.

Our system divides the input image into an S × S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.

Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as Pr(Object) ∗ IOUtruthpred . If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.

我们将目标检测的独立组件统一到一个神经网络中。我们的网络使用整个图像的特征来预测每个边界框。它还同时预测一个图像的所有类别的所有边界框。这意味着我们的网络对整个图像和图像中的所有目标进行全局推理。YOLO设计实现了端到端的训练和实时速度，同时保持了高平均精度。

我们的系统将输入图像划分为一个S×S网格。如果一个目标的中心落入一个网格单元，该网格单元就负责检测该物体。

每个网格单元预测出B个边界框和这些框的置信度分数。这些置信度分数反映了模型对框包含目标的自信程度，也反映了它认为它预测的框有多准确。形式上，我们将置信度定义为Pr(Object) ∗ IOU truthpred。如果该单元格中没有目标存在，那么置信度分数应该为零。否则，我们希望置信度得分等于预测框和真实框之间的交集（IOU）。

每个边界框由5个预测值组成：x、y、w、h和置信度。(x, y)坐标代表框的中心相对于网格单元的边界。宽度和高度是相对于整个图像的预测。最后，置信度预测表示预测的框和任何真实框之间的IOU。

Each grid cell also predicts C conditional class probabilities, Pr(Classi|Object). These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.

At test time we multiply the conditional class probabilities and the individual box confidence predictions.

$\begin{equation} \operatorname{Pr}\left(\text { Class }_{i} \mid \text { Object }\right) * \operatorname{Pr}(\text { Object }) * \mathrm{IOU}_{\text {pred }}^{\text {truth }}=\operatorname{Pr}\left(\text { Class }_{i}\right) * \mathrm{IOU}_{\text {pred }}^{\text {truth }} \end{equation}$
which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

每个网格单元还预测了C类的条件概率，Pr(Classi|Object)。这些概率是以包含目标的网格单元为条件的。我们只预测每个网格单元的一组类别概率，而不考虑框B的数量。

在测试时，我们将条件类概率和单个框的置信度预测相乘：
$\begin{equation} \operatorname{Pr}\left(\text { Class }_{i} \mid \text { Object }\right) * \operatorname{Pr}(\text { Object }) * \mathrm{IOU}_{\text {pred }}^{\text {truth }}=\operatorname{Pr}\left(\text { Class }_{i}\right) * \mathrm{IOU}_{\text {pred }}^{\text {truth }} \end{equation}$
这给我们提供了每个框的特定类别的置信度分数。这些分数既是对该类出现在框里的概率的编码，也是对预测的框与目标的匹配程度的编码。

Figure 2: The Model. Our system models detection as a regression problem. It divides the image into an S × S grid and for each grid
cell predicts B bounding boxes, confidence for those boxes, and C
class probabilities. These predictions are encoded as an S × S × (B ∗
5 + C) tensor.

**图2：模型。**我们的系统将检测建模为一个回归问题。它将图像划分为一个S×S的网格，并为每个网格单元预测B个边界框、这些框的置信度和C类概率。这些预测被编码为一个S×S×（B∗5+C）张量。

For evaluating YOLO on PASCAL VOC, we use S = 7, B = 2. PASCAL VOC has 20 labelled classes so C = 20. Our final prediction is a 7 × 7 × 30 tensor.

为了在PASCAL VOC上评估YOLO，我们使用S=7，B=2。PASCAL VOC有20个标记的类，所以C = 20。

2.1. Network Design

2.1. 网络设计

We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset [9]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.

Our network architecture is inspired by the GoogLeNet model for image classification [34]. Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, we simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers, similar to Lin et al [22]. The full network is shown in Figure 3.

我们将这个模型实现为卷积神经网络，并在PASCAL VOC检测数据集[9]上对其进行评估。网络的初始卷积层从图像中提取特征，而全连接层则预测输出概率和坐标。

我们的网络结构是受用于图像分类的GoogLeNet模型的启发[34]。我们的网络有24个卷积层，然后是2个全连接层。我们没有使用GoogLeNet使用的接收模块，而是简单地使用1×1的还原层，然后是3×3的卷积层，与Lin等人[22]类似。完整的网络显示在图3中。

Figure 3: The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1 × 1 convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classification task at half the resolution (224 × 224 input image) and then double the resolution for detection.

**图3：架构。**我们的检测网络有24个卷积层，然后是2个全连接层。交替出现的1×1卷积层减少了前几层的特征空间。我们在ImageNet分类任务中以一半的分辨率（224×224输入图像）对卷积层进行预训练，然后以两倍的分辨率进行检测。

We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.

The final output of our network is the 7 × 7 × 30 tensor of predictions.

我们还训练了一个快速版本的YOLO，旨在推动快速目标检测的边界。快速YOLO使用的神经网络的卷积层较少（9层而不是24层），这些层中的滤波器也较少。除了网络的大小，所有的训练和测试参数在YOLO和Fast YOLO之间都是一样的。

我们网络的最终输出是7×7×30的预测张量。

2.2. Training

2.2. 训练

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [30]. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24]. We use the Darknet framework for all training and inference [26].

We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [29]. Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.

Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.

We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:
$\begin{equation} \phi(x)= \begin{cases}x, & \text { if } x>0 \\ 0.1 x, & \text { otherwise }\end{cases} \end{equation}$
我们在ImageNet 1000级竞赛数据集[30]上预训练卷积层。对于预训练，我们使用了图3中的前20个卷积层，然后是一个平均池化层和一个全连接层。我们对这个网络进行了大约一周的训练，并在ImageNet 2012验证集上取得了88%的单目标top-5的准确率，与Caffe的Model Zoo[24]中的GoogLeNet模型相当。我们使用Darknet框架进行所有训练和推理[26]。

然后我们转换模型来进行检测。Ren等人的研究表明，在预训练的网络中同时添加卷积层和连接层可以提高性能[29]。按照他们的例子，我们添加了四个卷积层和两个全连接层，权重随机初始化。检测通常需要精细的视觉信息，因此我们将网络的输入分辨率从224×224提高到448×448。

我们的最后一层预测了类别概率和边界框坐标。我们通过图像的宽度和高度来规范边界框的宽度和高度，使它们在0和1之间。我们将边界框的x和y坐标参数化为特定网格单元位置的偏移，因此它们也在0和1之间。

我们在最后一层使用线性激活函数，所有其他层都使用下面的漏整型线性激活：
$\begin{equation} \phi(x)= \begin{cases}x, & \text { if } x>0 \\ 0.1 x, & \text { otherwise }\end{cases} \end{equation}$
We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal.

Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters, λcoord and λnoobj to accomplish this. We set λcoord = 5 and λnoobj = .5.

我们对模型输出的平方误差之和进行优化。我们使用平方误差，因为它很容易优化，然而它并不完全符合我们最大化平均精度的目标。它对定位误差和分类误差的权重相同，这可能不是很理想。

另外，在每张图像中，许多网格单元不包含任何目标。这就把这些单元的 "置信度 "分数推向了零，往往压倒了含有目标的单元的梯度。这可能导致模型的不稳定，使训练在早期就出现偏差。

为了解决这个问题，我们增加了边界框坐标预测的损失，减少了不包含物体的框的置信度预测的损失。我们使用两个参数，λcoord和λnoobj来实现这一目标。我们设定λcoord = 5，λnoobj = 0.5。

Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.

误差总和也同样权衡大框和小框中的误差。我们的误差度量应该反映出大框里的小偏差比小框里的小偏差更重要。为了部分解决这个问题，我们预测边界框的宽度和高度的平方根，而不是直接预测宽度和高度。

YOLO预测了每个网格单元的多个边界框。在训练时，我们只希望每个目标有一个边界框预测器负责。我们指定一个预测器 "负责 "预测一个目标，依据是哪个预测器与地面实况的当前IOU最高。这就导致了边界框预测器之间的专业化。每个预测器都能更好地预测某些尺寸、长宽比或物体的类别，从而提高整体召回率。

During training we optimize the following, multi-part loss function:

在训练过程中，我们优化以下的多部分损失函数：
$\begin{equation} \begin{gathered} \lambda_{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{i j}^{\text {obj }}\left[\left(x_{i}-\hat{x}_{i}\right)^{2}+\left(y_{i}-\hat{y}_{i}\right)^{2}\right] \\ +\lambda_{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{i j}^{\text {obj }}\left[\left(\sqrt{w_{i}}-\sqrt{\hat{w}_{i}}\right)^{2}+\left(\sqrt{h_{i}}-\sqrt{\hat{h}_{i}}\right)^{2}\right] \\ +\sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{i j}^{\text {obj }}\left(C_{i}-\hat{C}_{i}\right)^{2} \\ +\lambda_{\text {noobj }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{i j}^{\text {noobj }}\left(C_{i}-\hat{C}_{i}\right)^{2} \\ +\sum_{i=0}^{S^{2}} \mathbb{1}_{i}^{\text {obj }} \sum_{c \in \text { classes }}\left(p_{i}(c)-\hat{p}_{i}(c)\right)^{2} \end{gathered} \end{equation}$ 、IC替代型号，打造电子元器件IC百科大全！

YOLOv1论文中英文对照翻译

相关文章