【神經網路和深度學習-開發案例】第四章神經網路如何對數字進行分類

【神經網路和深度學習】

第四章神經網路如何對數字進行分類

案例：使用神經網路識別手寫數字

好了，讓我們來寫一個程式，學習如何識別手寫的數字，使用隨機梯度下降和MNIST的訓練資料。我們將用一個簡短的Python（2.7）程式來完成這項工作，只需要74行程式碼！我們需要的第一件事就是獲取MNIST的資料。如果您是一個git使用者，那麼您可以通過克隆這本書的程式碼庫來獲得資料

git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git

順便說一下，當我更早地描述MNIST的資料時，我說它被分成了6萬個訓練影象和1萬個測試影象。這是官方的MNIST描述。但後來在書中我們會發現它有用的在搞清楚如何設定神經網路的某些超級引數—諸如學習速率,等等。儘管驗證資料並不是原始的MNIST規範的一部分，但是許多人以這種方式使用MNIST，並且在神經網路中使用驗證資料是很常見的。當我提到“MNIST訓練資料”如前所述,MNIST資料集是基於NIST收集的兩個資料集,美國國家標準與技術研究院。為了構建NIST的資料集，NIST的資料集被精簡了，並被Yann LeCun、科琳娜科爾特斯和克里斯托弗j.c.Burges所採用的更方便的格式。有關更多細節，請參見此連結。我的儲存庫中的資料集是以一種形式，使得在Python中載入和操縱nist的資料變得很容易。我從蒙特利爾大學的LISA機器學習實驗室（連結）獲得了這種特殊形式的資料。
除了MNIST的資料之外，我們還需要一個名為Numpy的Python庫，用於快速線性代數。如果你還沒有安裝Numpy，你可以在這裡找到它。
在給出完整的清單之前，讓我解釋一下神經網路程式碼的核心特性。中心是一個網路類，我們用它來表示一個神經網路。下面是我們用來初始化一個網路物件的程式碼：

class Network(object):

def __init__(self, sizes):
    self.num_layers = len(sizes)
    self.sizes = sizes
    self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
    self.weights = [np.random.randn(y, x) 
                    for x, y in zip(sizes[:-1], sizes[1:])]

在這段程式碼中，列表大小包含了各個層中神經元的數量。舉個例子，如果我們想要建立一個網路物件在第一層有兩個神經元，第二層的3個神經元，最後一層的1個神經元，我們會用程式碼來做這個。

           $net = Network([2, 3, 1])

網路物件中的偏差和權重都是隨機初始化的，使用np.random.randn函式生成高斯分佈的平均值0和標準差1。這個隨機初始化給出了我們的隨機梯度下降演算法一個起點。在後面的章節中，我們會找到更好的方法來初始化權重和偏差，但現在就可以了。請注意，網路初始化程式碼假設第一層神經元是一個輸入層，並省略了對這些神經元的任何偏見，因為偏差只用於計算後期的輸出。

這裡寫圖片描述
這個方程裡有很多東西，讓我們把它拆開。 $a$ 是第二層神經元啟用的載體。為了得到 $a^{'}$ ，我們把 $a$ 乘以權重矩陣 $w$ ，然後加上偏差的向量 $b$ 。然後我們將這個函式元素應用到向量 $w a + b$

w a + b

的每一個條目上。（這被稱為向量化函式。）

考慮到這一點，可以很容易地從網路例項中編寫程式碼來計算輸出。我們首先定義sigmoid函式：
根據：

這裡寫圖片描述

def sigmoid(z):
return 1.0/(1.0+np.exp(-z))

再根據：

這裡寫圖片描述

def feedforward(self, a):
    """Return the output of the network if "a" is input."""
    for b, w in zip(self.biases, self.weights):
        a = sigmoid(np.dot(w, a)+b)
    return a

當然，我們希望我們的網路物件所做的主要事情是學習。為了達到這個目的，我們將給他們一個SGD方法來實現隨機梯度下降。這裡的程式碼。在一些地方有點神祕，但我將在列表之後把它分解。

 def SGD(self, training_data, epochs, mini_batch_size, eta,
        test_data=None):
    """Train the neural network using mini-batch stochastic
    gradient descent.  The "training_data" is a list of tuples
    "(x, y)" representing the training inputs and the desired
    outputs.  The other non-optional parameters are
    self-explanatory.  If "test_data" is provided then the
    network will be evaluated against the test data after each
    epoch, and partial progress printed out.  This is useful for
    tracking progress, but slows things down substantially."""
    if test_data: n_test = len(test_data)
    n = len(training_data)
    for j in xrange(epochs):
        random.shuffle(training_data)
        mini_batches = [
            training_data[k:k+mini_batch_size]
            for k in xrange(0, n, mini_batch_size)]
        for mini_batch in mini_batches:
            self.update_mini_batch(mini_batch, eta)
        if test_data:
            print "Epoch {0}: {1} / {2}".format(
                j, self.evaluate(test_data), n_test)
        else:
            print "Epoch {0} complete".format(j)

訓練資料是一組元組 $（ x ， y ）$ 表示訓練輸入和相應的期望輸出。你所期望的變數的大小和小批量的大小是你所期望的，在取樣時使用的小批量的數量和小批量的大小。 $e t a S$ 是學習速率。如果提供了可選引數 $t e s t d a t a$ ，那麼程式將在每次培訓後評估網路，並打印出部分進展。這對於跟蹤進度很有用，但是會大大降低進度。
程式碼工作如下。在每個時代，它都是通過隨機打亂訓練資料開始，然後將其劃分成小批量的適當大小。這是一種從訓練資料中隨機抽取的簡單方法。然後對於每一個小批量，我們應用一個梯度下降的步驟。這是由程式碼 $s e l f$ 完成的。 $u p d a t e m i n i b a t c h （ m i n i b a t c h ， e t a ）$ ，它根據一個單一的梯度下降的迭代來更新網路的權重和偏差，只使用 $m i n i b a t c h$ 的訓練資料。下面是 $u p d a t e_{m} i n i_{b} a t c h$ 方法的程式碼：

def update_mini_batch(self, mini_batch, eta):
    """Update the network's weights and biases by applying
    gradient descent using backpropagation to a single mini batch.
    The "mini_batch" is a list of tuples "(x, y)", and "eta"
    is the learning rate."""
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    for x, y in mini_batch:
        delta_nabla_b, delta_nabla_w = self.backprop(x, y)
        nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
        nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
    self.weights = [w-(eta/len(mini_batch))*nw 
                    for w, nw in zip(self.weights, nabla_w)]
    self.biases = [b-(eta/len(mini_batch))*nb 
                   for b, nb in zip(self.biases, nabla_b)]

大部分的工作都是由直線完成的

    delta_nabla_b, delta_nabla_w = self.backprop(x, y)

這呼叫了所謂的反向傳播演算法，這是計算成本函式梯度的一種快速方法。 $u p d a t e_{m} i n i_{b} a t c h$ 的工作原理是簡單地計算出 $m i n i_{b} a t c h$ 中每一個訓練示例的梯度，然後更新 $s e l f . w e i g h t s$ 和 $s e l f . b i a s e s$ 。

我不打算展示 $s e l f . b a c k p r o p 的$ 程式碼。我們將在下一章中研究反向傳播的工作原理，包括 $s e l f . b a c k p r o p$ 的程式碼。現在，只要假設它的行為就像宣告的那樣，返回適當的梯度，以獲得與培訓示例 $x$ 相關的資料。

讓我們看一下完整的程式，包括文件字串，我在上面省略了，除了 $s e l f . b a c k p r o p$ ,支援這個專案—所有的繁重工作都是在 $s e l f . S G D$ 和 $s e l f . u p d a t e_{m} i n i_{b} a t c h$ 完成的，我們已經討論過了。

注意，雖然程式看起來很長，但是大部分程式碼都是文件字串，目的是使程式碼易於理解。事實上，這個程式只包含74行非空白、非註釋程式碼。所有的程式碼都可以在GitHub上找到。

  ""
    network.py


A module to implement the stochastic gradient descent learning
algorithm for a feedforward neural network.  Gradients are calculated
using backpropagation.  Note that I have focused on making the code
simple, easily readable, and easily modifiable.  It is not optimized,
and omits many desirable features.
"""

 #### Libraries
 # Standard library
  import random

 # Third-party libraries
  import numpy as np

 class Network(object):

def __init__(self, sizes):
    """The list "sizes" contains the number of neurons in the
    respective layers of the network.  For example, if the list
    was [2, 3, 1] then it would be a three-layer network, with the
    first layer containing 2 neurons, the second layer 3 neurons,
    and the third layer 1 neuron.  The biases and weights for the
    network are initialized randomly, using a Gaussian
    distribution with mean 0, and variance 1.  Note that the first
    layer is assumed to be an input layer, and by convention we
    won't set any biases for those neurons, since biases are only
    ever used in computing the outputs from later layers."""

    self.num_layers = len(sizes)
    self.sizes = sizes
    self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
    self.weights = [np.random.randn(y, x)
                    for x, y in zip(sizes[:-1], sizes[1:])]

def feedforward(self, a):
    for b, w in zip(self.biases, self.weights):
        a = sigmoid(np.dot(w, a)+b)
    return a

def SGD(self, training_data, epochs, mini_batch_size, eta,
        test_data=None):

    """Train the neural network using mini-batch stochastic
    gradient descent.  The ``training_data`` is a list of tuples
    ``(x, y)`` representing the training inputs and the desired
    outputs.  The other non-optional parameters are
    self-explanatory.  If ``test_data`` is provided then the
    network will be evaluated against the test data after each
    epoch, and partial progress printed out.  This is useful for
    tracking progress, but slows things down substantially."""

    if test_data: n_test = len(test_data)
    n = len(training_data)
    for j in xrange(epochs):
        random.shuffle(training_data)
        mini_batches = [
            training_data[k:k+mini_batch_size]
            for k in xrange(0, n, mini_batch_size)]
        for mini_batch in mini_batches:
            self.update_mini_batch(mini_batch, eta)
        if test_data:
            print "Epoch {0}: {1} / {2}".format(
                j, self.evaluate(test_data), n_test)
        else:
            print "Epoch {0} complete".format(j)

def update_mini_batch(self, mini_batch, eta):

    """Update the network's weights and biases by applying
    gradient descent using backpropagation to a single mini batch.
    The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
    is the learning rate."""

    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    for x, y in mini_batch:
        delta_nabla_b, delta_nabla_w = self.backprop(x, y)
        nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
        nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
    self.weights = [w-(eta/len(mini_batch))*nw
                    for w, nw in zip(self.weights, nabla_w)]
    self.biases = [b-(eta/len(mini_batch))*nb
                   for b, nb in zip(self.biases, nabla_b)]

def backprop(self, x, y):
    """Return a tuple ``(nabla_b, nabla_w)`` representing the
    gradient for the cost function C_x.  ``nabla_b`` and
    ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
    to ``self.biases`` and ``self.weights``."""

    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    # feedforward
    activation = x
    activations = [x] # list to store all the activations, layer by layer
    zs = [] # list to store all the z vectors, layer by layer
    for b, w in zip(self.biases, self.weights):
        z = np.dot(w, activation)+b
        zs.append(z)
        activation = sigmoid(z)
        activations.append(activation)
    # backward pass
    delta = self.cost_derivative(activations[-1], y) * \
        sigmoid_prime(zs[-1])
    nabla_b[-1] = delta
    nabla_w[-1] = np.dot(delta, activations[-2].transpose())
    # Note that the variable l in the loop below is used a little
    # differently to the notation in Chapter 2 of the book.  Here,
    # l = 1 means the last layer of neurons, l = 2 is the
    # second-last layer, and so on.  It's a renumbering of the
    # scheme in the book, used here to take advantage of the fact
    # that Python can use negative indices in lists.
    for l in xrange(2, self.num_layers):
        z = zs[-l]
        sp = sigmoid_prime(z)
        delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
        nabla_b[-l] = delta
        nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
    return (nabla_b, nabla_w)

def evaluate(self, test_data):
    """Return the number of test inputs for which the neural
    network outputs the correct result. Note that the neural
    network's output is assumed to be the index of whichever
    neuron in the final layer has the highest activation.""" 

    test_results = [(np.argmax(self.feedforward(x)), y)
                    for (x, y) in test_data]
    return sum(int(x == y) for (x, y) in test_results)

def cost_derivative(self, output_activations, y):
    """Return the vector of partial derivatives \partial C_x /
    \partial a for the output activations."""

    return (output_activations-y)

  #### Miscellaneous functions
   def sigmoid(z):
     """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

  def sigmoid_prime(z):
   """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

早些時候，我跳過了有關 $M N I S T$ 資料的載入細節。這是很簡單的。為了完整起見，這裡是程式碼。用來儲存MNIST資料的資料結構在文件字串中被描述——它是簡單的東西、元組和 $N u m p y n d a r r a y$

【神經網路和深度學習-開發案例】第四章神經網路如何對數字進行分類