1. 程式人生 > >【論文翻譯】ResNet論文中英對照翻譯--(Deep Residual Learning for Image Recognition)

【論文翻譯】ResNet論文中英對照翻譯--(Deep Residual Learning for Image Recognition)

【開始時間】2018.10.03

【完成時間】2018.10.05

【論文翻譯】ResNet論文中英對照翻譯--(Deep Residual Learning for Image Recognition)

【中文譯名】深度殘差學習在影象識別中的應用

【論文連結】https://arxiv.org/pdf/1512.03385.pdf

 

【補充】

1)ResNet Github參考:https://github.com/tornadomeet/ResNet

2)NIN的第一個N指mlpconv,第二個N指整個深度網路結構,即整個深度網路是由多個mlpconv構成的。

3)論文的發表時間是:

10 Dec 2015,ResNet是2015年的LSVRC競賽中, 在ImageNet比賽classification任務上的冠軍

【宣告】本文是本人根據原論文進行翻譯,有些地方加上了自己的理解,有些專有名詞用了最常用的譯法,時間匆忙,如有遺漏及錯誤,望各位包涵並指正。

  

                                          題目:

深度殘差學習在影象識別中的應用

Abstract(摘要)

    Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error ontheImageNet testset. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.

    更深層次的神經網路更難訓練。我們提出了一個殘差學習框架,以簡化對比以前使用的網路更深入的網路的訓練。我們根據層輸入顯式地將層重新表示為學習殘差函式( learning residual functions),而不是學習未定義函式。我們提供了綜合的經驗證據,表明這些殘差網路易於優化,並且可以從大幅度增加的深度中獲得精度。在ImageNet資料集上,我們估計殘差網路的深度可達152層--是vgg網路的8倍深[41],但仍然具有較低的複雜性。這些殘差網的集合在影象集上的誤差達到了3.57%。 這個結果獲得了ILSVRC2015的分類任務第一名,我們還用CIFAR-10資料集分析了100層和1000層的網路。

    

    The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions 1 , where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

    表示的深度對於許多視覺識別任務是非常重要的。僅僅由於我們的表示非常深入,我們在coco物件檢測資料集上得到了28%的相對改進。 深度殘差網路是我們參加ILSVRC & COCO 2015 競賽上所使用模型的基礎,並且我們在ImageNet檢測、ImageNet定位、COCO檢測以及COCO分割上均獲得了第一名的成績。

 

1. Introduction(介紹)

    Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21,50, 40]. Deep networks naturally integrate low/mid/high-level features [50] and classifiers in an end-to-end multi-layer fashion, and the “levels” of features can be enriched by the number of stacked layers(depth). Recent evidence[41, 44] reveals that network depth is of crucial importance, and the leading results [41, 44, 13, 16] on the challenging ImageNet dataset [36] all exploit “very deep” [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks [8, 12, 7, 32, 27] have also greatly benefited from very deep models.

    深層卷積神經網路[22,21]導致了影象分類[21,50,40]的一系列突破。深層網路自然地將低/中/高層次特徵[50]和分類器以端到端的多層方式整合在一起,而特徵的“層次”可以通過堆疊層的數量((深度)來豐富。最近的證據[41,44]表明,網路深度至關重要,在富有挑戰性的ImageNet資料集[36]上的領先結果[41,44,13,16]都利用了“非常深”[41]模型,深度為16[41]至30[16]。許多其他非平凡(nontrivial)的視覺識別任務[8,12,7,32,27]也從非常深入的模型中獲益良多。

 

    Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers?An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers [16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].

    在深度重要性的驅動下,出現了一個問題:學習更好的網路就像堆積更多的層一樣容易嗎?回答這個問題的一個障礙是臭名昭著的梯度消失/爆炸[1,9]的問題,它從一開始就阻礙了收斂(hamper convergence )。然而,這個問題在很大程度上是通過標準化初始化[23,9,37,13]和中間歸一化層[16]來解決的,這使得數十層的網路在反向傳播的隨機梯度下降(SGD)上能夠收斂。

 

    When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly verified by our experiments. Fig. 1 shows a typical example.

    當更深的網路能夠開始收斂時,一個退化的問題就暴露出來了:隨著網路深度的增加,精確度變得飽和(這可能不足為奇),然後迅速退化。出乎意料的是,這種退化並不是由於過度擬合造成的,而且在適當深度的模型中增加更多的層會導致更高的訓練誤差,正如[11,42]中所報告的,並通過我們的實驗進行了徹底驗證。圖1顯示了一個典型的例子。

圖1、20層和56層“樸素”網路的CIFAR-10上的訓練錯誤(左)和測試錯誤(右)。網路越深,訓練誤差越大,測試誤差越大。圖4中給出了ImageNet上的類似現象

 

    The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that

   (訓練精度的)退化表明,並非所有系統都同樣容易優化。讓我們考慮一種更淺的體系結構及其更深層次的架構,它增加了更多的層。 對於更深的模型,這有一種通過構建的解決方案:恆等對映(identity mapping)來構建增加的層,而其它層直接從淺層模型中複製而來。該解的存在性表明,更深層次的模型不應比較淺的模型產生更高的訓練誤差。 但是實驗表明,我們目前無法找到一個與這種構建的解決方案相當或者更好的方案(或者說無法在可行的時間內實現)。

 

   In this paper, we address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F(x) := H(x)−x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.

   在本文中,我們通過引入深度殘差學習框架( a deep residual learning framework )來解決退化問題。我們不是希望每個層疊層直接擬合所需的底層對映(desired underlying mapping),而是顯式地讓這些層擬合一個殘差對映(residual mapping)。 假設所需的底層對映為 H(x)H(x),我們讓堆疊的非線性層來擬合另一個對映: F(x):=H(x)−xF(x):=H(x)−x。 因此原來的對映轉化為: F(x)+xF(x)+x。我們假設優化殘差對映比優化原始的未參考的對映容易。在極端情況下,如果恆等對映是最優的,則將殘差推至零比用一堆非線性層擬合恆等對映更容易。

 

    The formulation of F(x)+x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2). Shortcut connections [2, 34, 49] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe [19]) without modifying the solvers.

    公式 F(x)+x 可以通過前饋神經網路( feedforward neural networks )的“快捷連線(shortcut connections)”來實現(圖2)。捷徑連線[2,34,49]是跳過一個或多個層的連線。在本例中,快捷連線只執行恆等對映,它們的輸出被新增到疊加層的輸出中(圖2)。恆等捷徑連線既不增加額外的引數,也不增加計算的複雜性。整個網路仍然可以使用反向傳播的SGD進行端到端的訓練,並且可以使用公共庫(例如caffe[19])來實現,而無需修改求解器( solvers)。

                                    圖2、殘差學習:一個積木塊

   We present comprehensive experiments on ImageNet[36] to show the degradation problem and evaluate our method. We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart “plain” nets (that simply stack layers) exhibit higher training error when the depth increases; 2) Our deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks.

   我們在ImageNet[36]上進行了綜合實驗,以說明退化問題並對我們的方法進行評估。結果表明:1)我們的極深殘差網路易於優化,但對應的“樸素”網(即簡單的層疊層)隨著深度的增加,訓練誤差較大;2)我們的深層殘差網可以很容易地從深度的大幅度增加中獲得精度增益,比以前的網路產生的結果要好得多。

    

     Similar phenomena are also shown on the CIFAR-10 set [20], suggesting that the optimization difficulties and the effects of our method are not just akin to a particular dataset. We present successfully trained models on this dataset with over 100 layers, and explore models with over 1000 layers.

     CIFAR-10資料集上也出現了類似的現象,這表明了我們提出的方法的優化難度和效果並不僅僅是對於一個特定資料集而言的。我們在這個資料集上展示了經過成功訓練的100層以上的模型,並探索了1000層以上的模型。

 

    On the ImageNet classification dataset [36], we obtain excellent results by extremely deep residual nets. Our 152-layer residual net is the deepest network ever presented on ImageNet, while still having lower complexity than VGG nets [41]. Our ensemble has 3.57% top-5 error on the ImageNet test set, and won the 1st place in the ILSVRC 2015 classification competition. The extremely deep representations also have excellent generalization performance on other recognition tasks, and lead us to further win the 1st places on: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competitions. This strong evidence shows that the residual learning principle is generic, and we expect that it is applicable in other vision and non-vision problems.

    在ImageNet分類集[36]上,我們利用極深的殘差網得到了很好的結果。我們的152層剩餘網是迄今為止在ImageNet上出現的最深的網路,但其複雜度仍然低於vgg網[41]。我們的組合在ImageNet測試集上有3.57%的前5錯誤( top-5 error),並在ILSVRC 2015分類競賽中獲得了第一名。他在其他識別任務上也有很好的泛化能力,使我們在ILSVRC中的影象網路檢測、影象網路定位、coco檢測和coco分割方面獲得了第一名。這一強有力的證據表明,殘差學習原理是通用的,我們期望它適用於其他視覺和非視覺問題。

 

2. Related Work(相關工作

    Residual Representations. In image recognition, VLAD [18] is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector [30] can be formulated as a probabilistic version [18] of VLAD. Both of them are powerful shallow representations for image retrieval and classification [4, 48]. For vector quantization,encoding residual vectors [17] is shown to be more effective than encoding original vectors.

    殘差表示。在影象識別中,VLAD[18]是用殘差向量對字典進行編碼的表示,Fisher向量[30]可以表示為VLAD的概率版本[18]。兩者都是影象檢索和分類的有力表示法[4,48]。對於向量量化,編碼剩餘向量[17]比編碼原始向量更有效。

 

   In low-level vision and computer graphics, for solving Partial Differential Equations (PDEs), the widely used Multigrid method [3] reformulates the system as subprob lems at multiple scales, where each subproblem is responsible for the residual solution between a coarser and a finer scale. An alternative to Multigrid is hierarchical basis preconditioning [45, 46], which relies on variables that represent residual vectors between two scales. It has been shown [3,45,46] that these solvers converge much faster than standard solvers that are unaware of the residual nature of the solutions. These methods suggest that a good reformulation or preconditioning can simplify the optimization.

   在低水平視覺和計算機圖形學中,對於求解偏微分方程(PDEs),廣泛使用的多重網格方法[3]在多尺度上將系統重新定義為子問題,其中每個子問題負責較粗和較細尺度之間的殘差解( residual solution )。多重網格的另一種選擇是分層基預處理( hierarchical basis preconditioning)[45,46],它依賴於在兩個尺度之間表示殘差向量的變數。[3,45,46]已證明這些求解器比不知道解的殘差性質的標準求解器收斂得快得多。這些方法表明,一個好的配方或預處理可以簡化優化。

 

    Shortcut Connections. Practices and theories that lead to shortcut connections [2, 34, 49] have been studied for a long time. An early practice of training multi-layer perceptrons (MLPs) is to add a linear layer connected from the network input to the output [34, 49]. In [44, 24], a few intermediate layers are directly connected to auxiliary classifiers for addressing vanishing/exploding gradients. The papers of [39, 38, 31, 47] propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections. In [44], an “inception” layer is composed of a shortcut branch and a few deeper branches.

    捷徑連線。捷徑連線[2, 34, 49] 已經經過了很長的一段實踐和理論研究過程。一個訓練多層感知器(MLPs)的早期實踐是新增一個線性層,從網路輸入連線到輸出[34,49]。在[44,24]中,一些中間層直接連線到輔助分類器,用於解決消失/爆炸梯度(的問題)。[39,38,31,47]的論文提出了用捷徑連線實現集中層響應、梯度和傳播誤差的方法。在[44]中,“Inception”層是由一個捷徑分支和幾個更深的分支組成的。

 

     Concurrent with our work, “highway networks” [42, 43] present shortcut connections with gating functions [15].These gates are data-dependent and have parameters, in contrast to our identity shortcuts that are parameter-free. When a gated shortcut is “closed” (approaching zero), the layers in highway networks represent non-residual functions. On the contrary, our formulation always learns residual functions; our identity shortcuts are never closed, and all information is always passed through, with additional residual functions to be learned. In addition, highway networks have not demonstrated accuracy gains with extremely increased depth (e.g., over 100 layers).

     與此同時,“ 高速網“(”highway networks”)[42,43]將捷徑連線與門控函式[15]結合起來。這些門依賴於資料,並且有引數,而我們的恆等快捷連線( identity shortcuts )是無引數的。當門控捷徑“關閉”(接近於零)時,公路網中的層表示非殘差函式。相反,我們的公式總是學習殘差函式;我們的恆等快捷連線永遠不會關閉,所有資訊都會被傳遞出去,而附加的殘差函式將被學習。此外,高速網路在深度極深(例如,超過100層)的情況下,沒有表現出精確性的提高。

 

3. Deep Residual Learning(深度殘差學習

3.1. Residual Learning(殘差學習

     Let us consider H(x) as an underlying mapping to be fit by a few stacked layers (not necessarily the entire net), with x denoting the inputs to the first of these layers. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions 2 , then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, i.e., H(x) − x (assuming that the input and output are of the same dimensions). So rather than expect stacked layers to approximate H(x), we explicitly let these layers approximate a residual function F(x) := H(x) − x. The original function thus becomes F(x)+x. Although both forms should be able to asymptotically approximate the desired functions (as hypothesized), the ease of learning might be different.

    讓我們把H(x)看作是由幾個層疊層(不一定是整個網)組成的底層對映,用x表示這些層中的第一個層的輸入。如果假設多個非線性層可以漸近逼近複雜函式【2--This hypothesis, however, is still an open question. See [28].】,則等於假設它們可以漸近逼近殘差函式,即H(X)−x(假設輸入和輸出具有相同的維度)。因此,與其期望疊加層近似H(X),我們不如顯式地讓這些層近似一個殘差函式F(x):=h(x)−x。原來的函式因此變成F(x)+x。雖然這兩種形式都應該能夠漸近地近似於所期望的函式(如假設),但學習的容易程度可能是不同的。

   

    This reformulation is motivated by the counterintuitive phenomena about the degradation problem (Fig. 1, left). As we discussed in the introduction, if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart. The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings.

    此重新表示( reformulation)是由有關退化問題的反直覺現象所驅動的(圖1,左)。正如我們在介紹中所討論的,如果可以將新增的層構造為恆等對映,則更深層次的模型應該具有不大於其淺層結構的訓練錯誤。退化問題表明求解者很難用多個非線性層逼近恆等對映。利用殘差學習重構,如果恆等對映是最優的,則求解者可以簡單地將多個非線性層的權值推向零,以逼近恆等對映。

     In real cases, it is unlikely that identity mappings are optimal, but our reformulation may help to precondition the problem. If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one. We show by experiments (Fig.7) that thel earned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning.

     在實際情況下,恆等對映不太可能是最優的,但是我們的重新表達對於這個問題的預處理是有幫助的。如果最優函式更接近於恆等對映而不是零對映,則求解者應該更容易找到與恆等對映有關的擾動(perturbations),而不是將其作為新的擾動來學習。我們通過實驗(圖7)證明了學習的殘差函式一般都有較小的響應,說明恆等對映提供了合理的預條件。

 

3.2. Identity Mapping by Shortcuts(通過快捷方式進行恆等對映

    We adopt residual learning to every few stacked layers. A building block is shown in Fig. 2. Formally, in this paper we consider a building block defined as:

    我們對每幾個層疊的層次採用殘差學習。圖2中展示出了一個積木塊(building block )。形式上,在本文中,我們考慮了一個積木塊被定義為:

 

      Here x and y are the input and output vectors of the layers considered. The function F(x,{W i }) represents the residual mapping to be learned. For the example in Fig. 2 that has two layers, F = W 2 σ(W 1 x) in which σ denotes ReLU [29] and the biases are omitted for simplifying notations. The operation F + x is performed by a shortcut connection and element-wise addition. We adopt the second nonlinearity after the addition (i.e., σ(y), see Fig. 2).

     這裡x和y是考慮的層的輸入和輸出向量。函式表示要學習的殘差對映。 圖2中的例子包含兩層,,其中σσ代表ReLU,為了簡化省略了偏置項。F+xF+x操作由一個快捷連線和元素級(element-wise)的加法來表示。在加法之後我們再執行另一個非線性操作(例如, σ(y)σ(y),如圖2)。 

 

     The shortcut connections in Eqn.(1) introduce neither extra parameter nor computation complexity. This is not only attractive in practice but also important in our comparisons between plain and residual networks. We can fairly compare plain/residual networks that simultaneously have the same number of parameters, depth, width, and computational cost (except for the negligible element-wise addition).

      Eq.1中的shortcut連線沒有增加額外的引數和計算複雜度。這不僅在實踐中很有吸引力,而且在我們比較普通網路和殘差網路時也很重要。 我們可以在引數、深度、寬度以及計算成本都相同的基礎上對兩個網路進行公平的比較(除了可以忽略不計的元素級的加法)。

  

    The dimensions of x and F must be equal in Eqn.(1). If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection W s by the shortcut connections to match the dimensions:

    在eqn.(1)中,x和F的維數必須相等。如果情況並非如此(例如,在更改輸入/輸出通道時),我們可以通過快捷連線執行線性投影W s ,以匹配維度:

  

      We can also use a square matrix W s in Eqn.(1). But we will show by experiments that the identity mapping is sufficient for addressing the degradation problem and is economical, and thus W s is only used when matching dimensions.

     我們還可以在eqn(1)中使用方陣Ws。但是,我們將通過實驗證明,恆等對映對於解決退化問題是足夠的,而且是經濟的,因此只有在匹配維數時才使用Ws。

 

     The form of the residual function F is flexible. Experiments in this paper involve a function F that has two or three layers (Fig. 5), while more layers are possible. But if

F has only a single layer, Eqn.(1) is similar to a linear layer: y = W 1 x+x, for which we have not observed advantages.

     殘差函式F的形式是靈活的。本文的實驗涉及一個函式F,它有兩個或三個層(圖5),然而它可能有更多的層。但如果F只有一個單層,則eqn.(1)類似於線性層:y=w1x+x,對此我們沒有發現任何優勢。

    

    We also note that although the above notations are about fully-connected layers for simplicity, they are applicable to convolutional layers. The function F(x,{W i }) can represent multiple convolutional layers. The element-wise addition is performed on two feature maps, channel by channel.

    我們還注意到,雖然為了簡單起見,上述表示法是關於全連通層的,但它們適用於卷積層。函式F(x,{wi})可以表示多個卷積層.元素級加法是在兩個特徵對映上相應通道上執行的。.

 

3.3. Network Architectures(網路結構)

    We have tested various plain/residual nets, and have observed consistent phenomena. To provide instances for discussion, we describe two models for ImageNet as follows.

    我們測試了各種普通/殘差網路,並觀察到一致的現象。為了提供討論的例項,我們對ImageNet的兩個模型進行了如下描述。

    

    Plain Network. Our plain baselines (Fig. 3, middle) are mainly inspired by the philosophy of VGG nets [41] (Fig. 3, left). The convolutional layers mostly have 3×3 filters and follow two simple design rules: (i) for the same output feature map size, the layers have the same number of filters; and (ii) if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer. We perform downsampling directly by convolutional layers that have a stride of 2. The network ends with a global average pooling layer and a 1000-way fully-connected layer with softmax. The total number of weighted layers is 34 in Fig. 3 (middle).

    Plain網路。我們的plain網路結構(圖3,中)主要受VGG網路 (圖.3,左)的啟發。卷積層主要為3*3的濾波器,並遵循以下兩點要求:(i) 輸出特徵對映尺寸相同的層含有相同數量的濾波器;(ii) 如果特徵尺寸減半,則濾波器的數量增加一倍來保證每層的時間複雜度相同。我們直接用步長為2的卷積層進行下采樣。網路以一個全域性平均池層和一個帶有Softmax的1000路全連線層結束。在圖3(中),有權值的層的總數為34 。

 

   It is worth noticing that our model has fewer filters and lower complexity than VGG nets [41] (Fig. 3, left). Our 34-layer baseline has 3.6 billion FLOPs (multiply-adds), which is only 18% of VGG-19 (19.6 billion FLOPs).

   值得注意的是,與vgg網[41](圖3,左)相比,我們的模型具有更少的濾波器和更低的複雜度。我們的34層基線(baseline)有36億FLOPs乘加),僅為VGG-19(196億FLOPs)的18%。

        圖3 對應於ImageNet的網路框架舉例。 :VGG-19模型 (196億個FLOPs)作為參考。:plain網路,含有34個引數層(36 億個FLOPs)。:殘差網路,含有34個引數層(36億個FLOPs)。虛線表示的shortcuts增加了維度。Table 1展示了更多細節和其它變體。 

 

   表1. 對應於ImageNet的結構框架。括號中為構建塊的引數(同樣見Fig.5),數個構建塊進行堆疊。下采樣由stride為2的conv3_1、conv4_1和conv5_1 來實現。

 

      Residual Network. Based on the above plain network, we insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version. The identity shortcuts (Eqn.(1)) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Fig.3). When the dimensions increase(dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in Eqn.(2) is used to match dimensions (done by 1×1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.

      殘差網路。基於上述plain網路,我們插入快捷連線(圖3,右)將網路轉換為對應的殘差版本。當輸入和輸出尺寸相同時(圖3中的實線快捷連線),可以直接使用恆等快捷鍵(eqn.1)。 當維度增加時(Fig.3中的虛線部分),考慮兩個選項: (A) shortcut仍然使用恆等對映,在增加的維度上使用0來填充,這樣做不會增加額外的引數; (B) 使用Eq.2的對映shortcut來使維度保持一致(通過1*1的卷積)。  對於這兩個選項,當shortcut跨越兩種尺寸的特徵圖時,均使用stride為2的卷積。

 

3.4. Implementation(實現)

   Our implementation for ImageNet follows the practice in [21, 41]. The image is resized with its shorter side randomly sampled in [256,480] for scale augmentation [41]. A 224×224 crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted [21]. The standard color augmentation in [21] is used. We adopt batch normalization (BN) [16] right after each convolution and before activation, following [16]. We initialize the weights as in [13] and train all plain/residual nets from scratch. We use SGD with a mini-batch size of 256. The learning rate starts from 0.1 and is divided by 10 when the error plateaus,

and the models are trained for up to 60×10 4 iterations. We use a weight decay of 0.0001 and a momentum of 0.9. We do not use dropout [14], following the practice in [16].

   我們對ImageNet的實現遵循了[21,41]中的做法。 調整影象的大小使它的短邊長度隨機的從[256,480] 中取樣來進行尺寸擴充套件( scale augmentation)[41]。 從影象或其水平翻轉中隨機抽取224×224 crop,並減去每個畫素平均(the per-pixel mean)[21]。使用了[21]中的標準顏色增強。我們遵循[16],在每次卷積之後,在啟用之前採用批歸一化(BN)[16]。我們像[13]一樣初始化權重,從零開始訓練所有plain/殘差網。我們使用小批量大小為256的SGD。學習速率從0.1開始,當誤差穩定時除以10,並且 整個模型進行60∗10^4次迭代訓練。我們使用的權重衰減為0.0001,動量為0.9。我們不使用Dropout[14],按照[16]的做法。

   

   In testing, for comparison studies we adopt the standard 10-crop testing [21]. For best results, we adopt the fully-convolutional form as in [41, 13], and average the scores at multiple scales (images are resized such that the shorter side is in {224,256,384,480,640}).

   在測試中,為了進行比較,我們採取標準的10-crop測試。 為了取得最好的效果,我們採用了[41,13]中的全卷積形式,並且在多尺度上平均分數(影象被調整大小,使較短的一面為{224,256,386,480,640})。

 

4. Experiments(實驗)

4.1. ImageNet Classification(ImageNet分類)

    We evaluate our method on the ImageNet 2012 classificationdataset[36]thatconsistsof1000classes. Themodels are trained on the 1.28 million training images, and evaluated on the 50k validation images. We also obtain a final result on the 100k test images, reported by the test server. We evaluate both top-1 and top-5 error rates.

    我們在ImageNet 2012分類資料集[36]上評估了我們的方法,該資料集包含1000個類。模型在128萬張訓練影象上進行了巽寮,在50k驗證影象上進行了評估。我們還獲得了測試伺服器報告的100 k張測試影象的最終結果。我們評估了前1位和前5位的錯誤率(top-1 and top-5 error rates)。

 

  Plain Networks. We first evaluate 18-layer and 34-layer plain nets. The 34-layer plain net is in Fig. 3 (middle). The 18-layer plain net is of a similar form. See Table 1 for detailed architectures

   普通的網路。我們首先評估18層和34層普通網。34層普通網在圖3(中)。18層普通網具有類似的形式.詳細的體系結構見表1。

 

   The results in Table 2 show that the deeper 34-layer plain net has higher validation error than the shallower 18-layer plain net. To reveal the reasons, in Fig. 4 (left) we compare their training/validation errors during the training procedure. We have observed the degradation problem - the 34-layer plain net has higher training error throughout the whole training procedure, even though the solution space of the 18-layer plain network is a subspace of that of the 34-layer one.

   表2的結果表明,較深的34層普通網比較淺的18層普通網具有更高的驗證誤差。為了揭示原因。在圖4(左)中,我們在訓練過程中比較他們的訓練/驗證錯誤。我們在整個訓練過程中觀察到了34層普通網的退化問題---它在整個訓練過程中訓練誤差較大,儘管18層普通網的解空間( solution space)是34層普通網的子空間。

  圖4、關於ImageNet的訓練。細曲線表示訓練誤差,粗體曲線表示中心crop的驗證誤差。左:18層和34層的普通網路。右:18層和34層殘差網路。在此圖中,與普通網路相比,殘差網路沒有(增加)額外的引數。

 

表2、ImageNet驗證上的Top-1錯誤(%,10-crop 測試)。在這裡,與普通的對等網相比,殘差網沒有額外的引數。圖4展示了培訓過程。

 

   We argue that this optimization difficulty is unlikely to be caused by vanishing gradients. These plain networks are trained with BN [16], which ensures forward propagated signals to have non-zero variances. We also verify that the backward propagated gradients exhibit healthy norms with BN. So neither forward nor backward signals vanish. In fact, the 34-layer plain net is still able to achieve competitive accuracy (Table 3), suggesting that the solver works to some extent. We conjecture that the deep plain nets may have exponentially low convergence rates, which impact the reducing of the training error【3-- We have experimented with more training iterations (3×) and still observed the degradation problem, suggesting that this problem cannot be feasibly addressed by simply using more iterations.】】. The reason for such optimization difficulties will be studied in the future.

   我們認為這種優化困難不太可能是由梯度消失引起的。這些簡單的網路使用BN[16]進行訓練,從而確保前向傳播的訊號具有非零方差。我們還驗證了使用BN的向後傳播梯度具有健康的行為(norms)。所以無論向前還是向後訊號都不會消失。事實上,34層普通網仍然能夠達到具有競爭力的精度(表3),這表明求解器在某種程度上可以工作。我們推測深普通網可能具有指數級別的低收斂速度,從而影響訓練誤差的減小【3--我們已經試驗了更多的訓練迭代(3×),並仍然觀察到退化問題,這表明這個問題不可能通過簡單地使用更多的迭代來解決】。這種優化困難的原因將在今後進行研究。

 

   Residual Networks. Next we evaluate 18-layer and 34-layer residual nets (ResNets). The baseline architectures are the same as the above plain nets, expect that a shortcut connection is added to each pair of 3×3 filters as in Fig. 3 (right). In the first comparison (Table 2 and Fig. 4 right), we use identity mapping for all shortcuts and zero-padding for increasing dimensions (option A). So they have no extra parameter compared to the plain counterparts.

   殘差網路。接下來,我們評估18層和34層殘差網(ResNet).基線結構與上述普通網相同,只是要求在每對3×3過濾器中新增一個快捷連線,如圖3(右)所示。在第一個比較中(表2和圖4右),我們使用所有快捷方式為恆等對映,以及使用零填充增加維度(選項A)。因此,與普通的對應網路相比,它們沒有額外的引數。

 

   We have three major observations from Table 2 and Fig. 4. First, the situation is reversed with residual learning – the 34-layer ResNet is better than the 18-layer ResNet (by 2.8%). More importantly, the 34-layer ResNet exhibits considerably lower training error and is generalizable to the validation data. This indicates that the degradation problem is well addressed in this setting and we manage to obtain accuracy gains from increased depth.

   我們從表2和圖4中有三個主要的觀察結果。首先,使用殘差學習的情況與之前(普通網)相反----34層ResNet優於18層 ResNet(2.8%).更重要的是,34層ResNet的訓練誤差要小得多,並且可以推廣到驗證資料.這表明,在這種情況下,退化問題得到了很好的解決,並且我們設法從增加的深度中獲得了精度增益。

   

   Second, compared to its plain counterpart, the 34-layer ResNet reduces the top-1 error by 3.5% (Table 2), resulting from the successfully reduced training error (Fig. 4 right vs. left). This comparison verifies the effectiveness of residual learning on extremely deep systems.

   第二,與普通的對應網路相比,34層ResNet使Top-1誤差減少了3.5%(表2),這是成功地減少了訓練誤差(圖2)的結果(圖4右 VS 左))。這一比較驗證了殘差學習在極深系統上的有效性。

 

   Last, we also note that the 18-layer plain/residual nets are comparably accurate (Table 2), but the 18-layer ResNet converges faster (Fig. 4 right vs. left). When the net is “not overly deep” (18 layers here), the current SGD solver is still able to find good solutions to the plain net. In this case, the ResNet eases the optimization by providing faster convergence at the early stage.

   最後,我們還注意到,18層普通/殘差網具有大致相等的精確率(表2),但18層ResNet的收斂速度更快(圖4右 VS 左)。當網路“不太深”(比如這裡的18層)時,當前的SGD解決程式仍然能夠為普通網路找到好的解決方案。在這種情況下,ResNet通過在早期階段提供更快的收斂速度來簡化優化。

 

    Identity vs. Projection Shortcuts. We have shown that parameter-free, identity shortcuts help with training. Next we investigate projection shortcuts (Eqn.(2)). In Table 3 we compare three options: (A) zero-padding shortcuts are used for increasing dimensions, and all shortcuts are parameter- free (the same as Table 2 and Fig. 4 right); (B) projection shortcuts are used for increasing dimensions, and other shortcuts are identity; and (C) all shortcuts are projections.

    恆等快捷鍵 VS 投影快捷鍵( Projection Shortcuts)。我們已經證明,無引數的恆等快捷鍵有助於培訓。接下來我們研究投影快捷鍵(eqn.(2)。在表3中,我們比較了三個選項:(A)零填充快捷鍵用於增加維度,所有快捷方式都是無引數的(與表2和圖4右相同);(B)投影快捷鍵用於增加尺寸,而其他快捷鍵是恆等快捷鍵;(C)所有快捷鍵都是投影快捷鍵。

  表3、對ImageNet驗證的錯誤率(%,10-crop測試)。vgg-16是基於我們的測試。ResNet-50/101/152屬於方案B ,它只對增加的維度使用投影(快捷鍵)。

 

   Table 3 shows that all three options are considerably better than the plain counterpart. B is slightly better than A. We argue that this is because the zero-padded dimensions in A indeedhavenoresiduallearning. Cismarginallybetterthan B, and we attribute this to the extra parameters introduced by many (thirteen) projection shortcuts. But the small differences among A/B/C indicate that projection shortcuts are not essential for addressing the degradation problem. So we do not use option C in the rest of this paper, to reduce memory/time complexity and model sizes. Identity shortcuts are particularly important for not increasing the complexity of the bottleneck architectures that are introduced below.

   表3顯示,所有這三個選項都比普通的對應方案要好得多。B比A稍好。我們認為,這是因為A中零填充的維度實際上並沒有(使用)殘差學習。C比b略好,我們把這歸因於許多(13個)投影快捷鍵引入的額外引數。但是a/b/c之間的小差異表明,投影快捷鍵對於解決退化問題並不重要。因此,在本文的其餘部分中,我們不使用選項c以降低計算/時間複雜度和模型大小。恆等快捷鍵對於不增加下面介紹的瓶頸架構(bottleneck architectures)的複雜性特別重要。

 

   Deeper Bottleneck Architectures. Next we describe our deeper nets for ImageNet. Because of concerns on the training time that we can afford, we modify the building block as a bottleneck design 4 . For each residual function F, we use a stack of 3 layers instead of 2 (Fig. 5). The three layers are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller

input/output dimensions. Fig. 5 shows an example, where both designs have similar time complexity.

   更深層次的瓶頸架構。接下來,我們將描述我們針對ImageNet的更深層次的網路。由於考慮到我們負擔得起的訓練時間,我們將積木塊(building block)修改為瓶頸設計(bottleneck design)。對於每個殘差函式F,我們使用一個由3層組成的堆疊,而不是2層(圖5)。這三層分別是1×1、3×3和1×1卷積,其中1×1層負責減小然後增加(恢復)維數,使3×3層成為輸入/輸出維數較小的瓶頸。圖5給出了一個例子,其中兩種設計都具有相似的時間複雜度。

圖5、ImageNet的一個更深層次的殘差函式F。左圖:如圖3所示的用於ResNet-34的一個及積木塊(在56×56特徵圖上)。右:ResNet-50/101/152的“瓶頸”積木塊。

 

   The parameter-free identity shortcuts are particularly important forthe bottleneck architectures. Ifthe identity shortcut in Fig. 5 (right) is replaced with projection, one can show that the time complexity and model size are doubled, as the shortcut is connected to the two high-dimensional ends. So identity shortcuts lead to more efficient models for the bottleneck designs.

  無引數的恆等快捷鍵對於瓶頸架構尤其重要。如果是圖5(右)中的同恆等快捷鍵用投影快捷鍵代替,可以看出時間複雜度和模型大小加倍,因為快捷方式連線到兩個高維端點。因此,恆等快捷鍵為瓶頸設計提供了更有效的模型。

 

     50-layer ResNet: We replace each 2-layer block in the 34-layer net with this 3-layer bottleneck block, resulting in a 50-layer ResNet (Table 1). We use option B for increasing dimensions. This model has 3.8 billion FLOPs.

     50層ResNet:我們將34層網路中的每個2層塊替換為這個3層瓶頸塊,從而形成一個50層ResNet(表1).我們使用選項B來增加維度。這種模式有38億次FLOPs。

    

    101-layer and 152-layer ResNets: We construct 101-layer and 152-layer ResNets by using more 3-layer blocks (Table 1). Remarkably, although the depth is significantly increased, the 152-layer ResNet (11.3 billion FLOPs) still has lower complexity than VGG-16/19 nets (15.3/19.6 billion FLOPs).

    101層和152層 ResNet:我們使用更多的三層塊來構造101層和152層的ResNet(表1).值得注意的是,雖然深度明顯增加,但152層 ResNet((113億次 FLOPs)的複雜性仍然低於vgg-16/19網(153/196億次 FLOPs)。

 

   The 50/101/152-layer ResNets are more accurate than the 34-layer ones by considerable margins (Table 3 and 4). We do not observe the degradation problem and thus enjoy significant accuracy gains from considerably increased depth. The benefits of depth are witnessed for all evaluation metrics (Table 3 and 4).

   50/101/152層ResNet比34層網有相當大程度的準確率提升(表3和表4)。我們沒有觀察到退化的問題,因此,我們可以享受從大大增加的深度中獲得的顯著的精確性。所有評價指標都能看到深度的好處(表3和表4)。

    表4、ImageNet驗證集上單模型結果的錯誤率(%)(除了†報告了測試集)。

 

       Comparisons with State-of-the-art Methods. In Table 4 we compare with the previous best single-model results. Our baseline 34-layer ResNets have achieved very competitive accuracy. Our 152-layer ResNet has a single-model top-5 validation error of 4.49%. This single-model result outperforms all previous ensemble results (Table 5). We combine six models of different depth to form an ensemble (only with two 152-layer ones at the time of submitting). This leads to 3.57% top-5 error on the test set (Table 5). This entry won the 1st place in ILSVRC 2015.

       和最先進的方法比較。在表4中,我們將與之前最好的單模型結果進行比較。我們的基線34層ResNet已經達到了非常競爭的準確性。我們的152層 ResNet的單模型Top-5驗證誤差為4.49%。這個單一模型的結果優於所有以前的整合結果(表5)。我們將六個不同深度的模型組合成一個整體(在提交時只有兩個152層網路)。這將得到了測試集上3.57%的Top-5錯誤(表5).這一專案獲得了2015年ILSVRC的第一名。

表5、組合的錯誤率(%)。在ImageNet的測試集上的top-5錯誤,並由測試伺服器報告。

 

4.2. CIFAR-10 and Analysis(CIFAR-10和分析)

     We conducted more studies on the CIFAR-10 dataset[20], which consists of 50k training images and 10k testing images in 10 classes. We present experiments trained on the training set and evaluated on the test set. Our focus is on the behaviors of extremely deep networks, but not on pushing the state-of-the-art results, so we intentionally use simple architectures as follows.

     我們在CIFAR-10資料集[20]進行了更多的研究,該資料集包括50個訓練影象和10個類別的10k測試影象。我們在訓練集上進行了實驗訓練,並在測試集上進行了評估。我們的重點是極深網路的行為,而不是推進最先進的結果,所以我們故意使用簡單的架構如下。

 

     The plain/residual architectures follow the form in Fig. 3 (middle/right). The network inputs are 32×32 images, with the per-pixel mean subtracted. The first layer is 3×3 convolutions. Then we use a stack of 6n layers with 3×3 convolutions on the feature maps of sizes {32,16,8} respectively, with 2n layers for each feature map size. The numbers of filtersare{16,32,64}respectively. The subsampling is performed by convolutions with a stride of 2. The network ends with a global average pooling, a 10-way fully-connected layer, andsoftmax. There are totally 6n+2 stacked weighted layers. The following table summarizes the architecture:

     普通/殘差結構遵循圖3(中/右)中的形式。網路輸入32×32影象,每畫素被減去每畫素平均( with the per-pixel mean subtracted)。第一層為3×3卷積層。然後,我們在尺寸為{32,16,8}的特徵對映上分別使用3×3卷積的共6n層疊加,每個特徵地圖大小都有2n層。濾波數分別為{16,32,64}。下采樣是以2的步長的卷積來執行的。。網路以全域性平均池、10路全連線層和softmax為結束。共有6n+2層加權層。下表概述了該體系結構:

     When shortcut connections are used, they are connected to the pairs of 3×3 layers (totally 3n shortcuts). On this dataset we use identity shortcuts in all cases (i.e., option A),so our residual models have exactly the same depth, width, and number of parameters as the plain counterparts.

    當使用快捷連線時,它們被連線到3×3層的對上(總共3n條捷徑)。在這個資料集中,我們在所有情況下都使用恆等快捷鍵(即選項A),因此,我們的殘差模型的深度、寬度和引數與對應的普通模型完全相同。

 

    We use a weight decay of 0.0001 and momentum of 0.9, and adopt the weight initialization in [13] and BN [16] but with no dropout. These models are trained with a mini-batch size of 128 on two GPUs. We start with a learning rate of 0.1, divide it by 10 at 32k and 48k iterations, and terminate training at 64k iterations, which is determined on a 45k/5k train/val split. We follow the simple data augmentation in [24] for training: 4 pixels are padded on each side,and a 32×32 crop is randomly sampled from the padded image or its horizontal flip. For testing, we only evaluate the single view of the original 32×32 image.

    我們使用權重衰減為0.0001,動量為0.9,採用了在[13]中的權值初始化在[13]和bn[16],但沒有使用Dropout。這些模型是在兩個GPU上訓練的,小批量大小為128。我們的學習速度為0.1,在第32K和48K次迭代時除以10,在64k迭代時終止訓練,這是由45k/5k訓練/驗證的分割決定。我們按照[24]中的簡單資料增強進行訓練:每邊填充4個畫素,從填充影象或水平翻轉中隨機抽取32×32幀。對於測試,我們只評估原始32×32影象的單一影象。

    

    We compare n = {3,5,7,9}, leading to 20, 32, 44, and 56-layer networks. Fig. 6 (left) shows the behaviors of the plain nets. The deep plain nets suffer from increased depth, and exhibit higher training error when going deeper. This phenomenon is similar to that on ImageNet (Fig. 4, left) and on MNIST (see [42]), suggesting that such an optimization difficulty is a fundamental problem.

    我們比較n={3,5,7,9},從而形成的20,32,44和56層網路。圖6(左)顯示了普通網的行為。深普通網隨著深度增大,深度訓練誤差較大。這個現象類似於ImageNet(圖4左)和mnist(見[42])上的現象,這表明這種優化困難是一個基本問題。

 

     Fig. 6 (middle) shows the behaviors of ResNets. Also similar to the ImageNet cases (Fig. 4, right), our ResNets manage to overcome the optimization difficulty and demonstrate accuracy gains when the depth increases.

     圖6(中間)顯示了殘差網的行為。類似於ImageNet的案例(圖4,右),當深度增加時,我們的殘差網也設法克服了優化的困難,並提高了精度。

 

     We further explore n = 18 that leads to a 110-layer ResNet. In this case, we find that the initial learning rate of 0.1 is slightly too large to start converging 5 . So we use 0.01 to warm up the training until the training error is below 80% (about 400 iterations), and then go back to 0.1 and continue training. The rest of the learning schedule is as done previously. This 110-layer network converges well (Fig. 6, middle). It has fewer parameters than other deep and thin

     我們進一步探索了當n=18時獲得的110層ResNet。在這種情況下,我們發現初始學習速率0.1太大,以致無法開始收斂。因此,我們使用0.01的學習速率為訓練熱身,直到訓練誤差小於80%(約400次迭代),然後將學習速率改為0.1繼續訓練。學習計劃的其餘部分和以前一樣。這個110層網路能夠很好地收斂(圖6, 中)。它的引數比 FitNet[35]和 Highway[42](表6)等其他深而瘦的網路少,但卻是最先進的結果之一(6.43%,表6)。

圖6、在CIFAR-10上的訓練。虛線表示訓練錯誤,粗體線表示測試錯誤。左:普通網路。普通-110的誤差大於60%,不顯示。中間:殘差網。右:有110層和1202層的殘差網。

表6、CIFAR-10測試集的分類錯誤。所有方式都有資料增強。對於ResNet-110,我們執行5次,顯示“最佳(平均±STD)”,如[43]中。

    Analysis of Layer Responses. Fig. 7 shows the standard deviations (std) of the layer responses. The responses are the outputs of each 3×3 layer, after BN and before other nonlinearity (ReLU/addition). For ResNets, this analy-sis reveals the response strength of the residual functions. Fig. 7 shows that ResNets have generally smaller responses than their plain counterparts. These results support our basic motivation (Sec.3.1) that the residual functions might be generally closer to zero than the non-residual functions. We also notice that the deeper ResNet has smaller magnitudes of responses, as evidenced by the comparisons among ResNet-20, 56, and 110 in Fig. 7. When there are more layers, an individual layer of ResNets tends to modify the signal less.

    層響應分析。圖7顯示層響應的標準差(std)。響應為每個3×3層的輸出,它在BN之後,在其它非線性(relu/加法)之前。對於ResNet,本分析揭示了殘差函式的響應強度。圖7表明殘差網的響應一般比普通網的小。這些結果支援了我們的基本動機(第3.1節),即殘差函式一般比非殘差函式更接近於零。我們還注意到,較深的ResNet的響應幅度較小,如圖7中ResNet-20、56和110之間的比較所示。當有更多的層時