1. 程式人生 > >計算機視覺學習記錄 - Implementing a Neural Network from Scratch - An Introduction

計算機視覺學習記錄 - Implementing a Neural Network from Scratch - An Introduction

dict 實踐 {} ann gen lua tps rst 損失函數

0 - 學習目標

  我們將實現一個簡單的3層神經網絡,我們不會仔細推到所需要的數學公式,但我們會給出我們這樣做的直觀解釋。註意,此次代碼並不能達到非常好的效果,可以自己進一步調整或者完成課後練習來進行改進。

1 - 實驗步驟

1.1 - Import Packages

# Package imports
import matplotlib.pyplot as plt
import numpy as np
import sklearn
import sklearn.datasets
import sklearn.linear_model
import matplotlib

# Display plots inline and change 
default figure size %matplotlib inline matplotlib.rcParams[figure.figsize] = (10.0, 8.0) # 指定matplotlib畫布規模

1.2 - Generating a dataset

  註意到,scikit-learn包包含了數據生成的代碼,因此我們無需自己實現,直接采用其make_moons方法即可。下圖中有兩種類別的點,藍點表示男患者,紅點表示女患者,而xy坐標表示醫學測量指標。我們的目的是去訓練一個模型可以根據醫學測量指標結果來劃分男女患者,註意到圖中的劃分界限不是簡單的線性的,因此采用簡單的邏輯回歸的效果合理不會很好。

# Generate a dataset and plot it
np.random.seed(0)
X, y = sklearn.datasets.make_moons(200, noise=0.20)
plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Spectral)
技術分享圖片

1.3 - Logistic Regression

  為了證明上述觀點我們來訓練一個邏輯回歸模型看看效果。輸入是xy坐標,而輸出是(0,1)二分類。我們直接使用scikit-learn包中的邏輯回歸算法做預測。

# Train the logistic regression classifier
clf 
= sklearn.linear_model.LogisticRegressionCV() clf.fit(X, y)
Out[3]:
LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class=ovr, n_jobs=1, penalty=l2, random_state=None,
           refit=True, scoring=None, solver=lbfgs, tol=0.0001, verbose=0)
# Helper function to plot a decision boundary.
# If you dont fully understand this function dont worry, it just generates the contour plot below.
def plot_decision_boundary(pred_func):
    # Set min and max values and give it some padding
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    h = 0.01
    # Generate a grid of points with distance h between them
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    # Predict the function value for the whole gid
    Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    # Plot the contour and training examples
    plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)
# Plot the decision boundary
plot_decision_boundary(lambda x: clf.predict(x))
plt.title("Logistic Regression")
技術分享圖片


  可以看到,邏輯回歸使用一條直線盡可能好的分割這個二分類問題,但是由於原先數據本就不是線性可分的,因此效果並不好。

1.4 - Training a Neural Network

  現在來構建一個有一個輸入層一個隱藏層和一個輸出層的簡單三層神經網絡來做預測。

1.4.1 - How our network makes predictions

  神經網絡通過下述公式進行預測。

$$
\begin{aligned}
z_1 & = xW_1 + b_1 \\
a_1 & = \tanh(z_1) \\
z_2 & = a_1W_2 + b_2 \\
a_2 & = \hat{y} = \mathrm{softmax}(z_2)
\end{aligned}
$$

1.4.2 - Learning the Parameters

  學習參數是讓我們的網絡找到一組參數 ($W_1, b_1, W_2, b_2$)使得訓練集上的損失最小化。現在我們來定義損失函數,這裏我們使用常用的交叉熵損失函數,如下:

$$
\begin{aligned}
L(y,\hat{y}) = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \log\hat{y}_{n,i}
\end{aligned}
$$

  而後我們使用梯度下降來最小化損失函數。我們將實現最簡單的梯度下降算法,其實就是有著固定學習率的批量梯度下降。在實踐中,梯度下降的一些變種如SGD(隨機梯度下降)或者最小批次梯度下降往往有更好的表現。因此後續我們可以通過這些點來改進效果。

  梯度下降需要計算出損失函數相對於我們要更新參數的梯度 $\frac{\partial{L}}{\partial{W_1}}$, $\frac{\partial{L}}{\partial{b_1}}$, $\frac{\partial{L}}{\partial{W_2}}$, $\frac{\partial{L}}{\partial{b_2}}$。為了計算這些梯度我們使用著名的反向傳播算法,這種方法能夠從輸出開始有效地計算梯度。此處不細講反向傳播是如何工作的,只給出方向傳播需要的公式,如下:

$$
\begin{aligned}
& \delta_3 = \hat{y} - y \\
& \delta_2 = (1 - \tanh^2z_1) \circ \delta_3W_2^T \\
& \frac{\partial{L}}{\partial{W_2}} = a_1^T \delta_3 \\
& \frac{\partial{L}}{\partial{b_2}} = \delta_3\\
& \frac{\partial{L}}{\partial{W_1}} = x^T \delta_2\\
& \frac{\partial{L}}{\partial{b_1}} = \delta_2 \\
\end{aligned}
$$

1.4.3 - Implementation

  開始實現!

  變量定義。

num_examples = len(X) # 訓練集大小
nn_input_dim = 2 # 輸入層維度
nn_output_dim = 2 # 輸出層維度

# Gradient descent parameters (I picked these by hand)
epsilon = 0.01 # 梯度下降學習率
reg_lambda = 0.01 # 正規化權重

  損失函數定義。

# Helper function to evaluate the total loss on the dataset
def calculate_loss(model):
    W1, b1, W2, b2 = model[W1], model[b1], model[W2], model[b2]
    # 前向傳播,計算出預測值
    z1 = X.dot(W1) + b1
    a1 = np.tanh(z1)
    z2 = a1.dot(W2) + b2
    exp_scores = np.exp(z2)
    probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
    # 計算損失
    corect_logprobs = -np.log(probs[range(num_examples), y])
    data_loss = np.sum(corect_logprobs)
    # 損失值加入正規化
    data_loss += reg_lambda/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)))
    return 1./num_examples * data_loss

  我們也實現了一個有用的用來計算網絡輸出的方法,其做了前向傳播計算並且返回最高概率類別。

# Helper function to predict an output (0 or 1)
def predict(model, x):
    W1, b1, W2, b2 = model[W1], model[b1], model[W2], model[b2]
    # Forward propagation
    z1 = x.dot(W1) + b1
    a1 = np.tanh(z1)
    z2 = a1.dot(W2) + b2
    exp_scores = np.exp(z2)
    probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
    return np.argmax(probs, axis=1)

  最後,使用批量梯度下降算法來訓練我們的神經網絡。

# This function learns parameters for the neural network and returns the model.
# - nn_hdim: Number of nodes in the hidden layer
# - num_passes: Number of passes through the training data for gradient descent
# - print_loss: If True, print the loss every 1000 iterations
def build_model(nn_hdim, num_passes=20000, print_loss=False):
    
    # 隨機初始化權重
    np.random.seed(0)
    W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)
    b1 = np.zeros((1, nn_hdim))
    W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim)
    b2 = np.zeros((1, nn_output_dim))

    # 返回字典初始化
    model = {}
    
    # 對於每一個批次進行梯度下降
    for i in range(0, num_passes):

        # 前向傳播
        z1 = X.dot(W1) + b1
        a1 = np.tanh(z1)
        z2 = a1.dot(W2) + b2
        exp_scores = np.exp(z2)
        probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)

        # 反向傳播
        delta3 = probs
        delta3[range(num_examples), y] -= 1
        dW2 = (a1.T).dot(delta3)
        db2 = np.sum(delta3, axis=0, keepdims=True)
        delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))
        dW1 = np.dot(X.T, delta2)
        db1 = np.sum(delta2, axis=0)

        # 加入正則化
        dW2 += reg_lambda * W2
        dW1 += reg_lambda * W1

        # 梯度下降參數更新
        W1 += -epsilon * dW1
        b1 += -epsilon * db1
        W2 += -epsilon * dW2
        b2 += -epsilon * db2
        
        # 分配新權重
        model = { W1: W1, b1: b1, W2: W2, b2: b2}
        
        # Optionally print the loss.
        # This is expensive because it uses the whole dataset, so we dont want to do it too often.
        if print_loss and i % 1000 == 0:
          print("Loss after iteration %i: %f" %(i, calculate_loss(model)))
    
    return model
1.4.4 - A network with a hidden layer of size 3
# Build a model with a 3-dimensional hidden layer
model = build_model(3, print_loss=True)

# Plot the decision boundary
plot_decision_boundary(lambda x: predict(model, x))
plt.title("Decision Boundary for hidden layer size 3")
Loss after iteration 0: 0.432387
Loss after iteration 1000: 0.068947
Loss after iteration 2000: 0.068926
Loss after iteration 3000: 0.071218
Loss after iteration 4000: 0.071253
Loss after iteration 5000: 0.071278
Loss after iteration 6000: 0.071293
Loss after iteration 7000: 0.071303
Loss after iteration 8000: 0.071308
Loss after iteration 9000: 0.071312
Loss after iteration 10000: 0.071314
Loss after iteration 11000: 0.071315
Loss after iteration 12000: 0.071315
Loss after iteration 13000: 0.071316
Loss after iteration 14000: 0.071316
Loss after iteration 15000: 0.071316
Loss after iteration 16000: 0.071316
Loss after iteration 17000: 0.071316
Loss after iteration 18000: 0.071316
Loss after iteration 19000: 0.071316
技術分享圖片

  這看起來比邏輯回歸的效果好多了!

1.5 - Varying the hidden layer size

plt.figure(figsize=(16, 32))
hidden_layer_dimensions = [1, 2, 3, 4, 5, 20, 50]
for i, nn_hdim in enumerate(hidden_layer_dimensions):
    plt.subplot(5, 2, i+1)
    plt.title(Hidden Layer size %d % nn_hdim)
    model = build_model(nn_hdim)
    plot_decision_boundary(lambda x: predict(model, x))
plt.show()
技術分享圖片

2 - Exercises

  我們給出了一些練習。

  • Instead of batch gradient descent, use minibatch gradient descent (more info) to train the network. Minibatch gradient descent typically performs better in practice.
  • We used a fixed learning rate $\epsilon$ for gradient descent. Implement an annealing schedule for the gradient descent learning rate (more info).
  • We used a $\tanh$ activation function for our hidden layer. Experiment with other activation functions (some are mentioned above). Note that changing the activation function also means changing the backpropagation derivative.
  • Extend the network from two to three classes. You will need to generate an appropriate dataset for this.
  • Extend the network to four layers. Experiment with the layer size. Adding another hidden layer means you will need to adjust both the forward propagation as well as the backpropagation code.

3 - Exercises (1)

4 - Exercises (2)

  使用模擬退火算法更新學習率,公式為$epsilon=\frac{epsilon_0}{1+d \times t}$。

# This function learns parameters for the neural network and returns the model.
# - nn_hdim: Number of nodes in the hidden layer
# - num_passes: Number of passes through the training data for gradient descent
# - print_loss: If True, print the loss every 1000 iterations
# - d: the decay number of annealing schedule
def build_model(nn_hdim, num_passes=20000, print_loss=False, d=10e-3):
    
    # Initialize the parameters to random values. We need to learn these.
    np.random.seed(0)
    W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)
    b1 = np.zeros((1, nn_hdim))
    W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim)
    b2 = np.zeros((1, nn_output_dim))

    # This is what we return at the end
    model = {}
    
    # Gradient descent. For each batch...
    for i in range(0, num_passes):

        # Forward propagation
        z1 = X.dot(W1) + b1
        a1 = np.tanh(z1)
        z2 = a1.dot(W2) + b2
        exp_scores = np.exp(z2)
        probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)

        # Backpropagation
        delta3 = probs
        delta3[range(num_examples), y] -= 1
        dW2 = (a1.T).dot(delta3)
        db2 = np.sum(delta3, axis=0, keepdims=True)
        delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))
        dW1 = np.dot(X.T, delta2)
        db1 = np.sum(delta2, axis=0)

        # Add regularization terms (b1 and b2 dont have regularization terms)
        dW2 += reg_lambda * W2
        dW1 += reg_lambda * W1

        epsilon_ = epsilon / (1+d*i) 
        # Gradient descent parameter update
        W1 += -epsilon_ * dW1
        b1 += -epsilon_ * db1
        W2 += -epsilon_ * dW2
        b2 += -epsilon_ * db2
        
        # Assign new parameters to the model
        model = { W1: W1, b1: b1, W2: W2, b2: b2}
        
        # Optionally print the loss.
        # This is expensive because it uses the whole dataset, so we dont want to do it too often.
        if print_loss and i % 1000 == 0:
          print("Loss after iteration %i: %f" %(i, calculate_loss(model)))
    
    return model
Loss after iteration 0: 0.432387
Loss after iteration 1000: 0.081007
Loss after iteration 2000: 0.075384
Loss after iteration 3000: 0.073729
Loss after iteration 4000: 0.072895
Loss after iteration 5000: 0.072376
Loss after iteration 6000: 0.072013
Loss after iteration 7000: 0.071742
Loss after iteration 8000: 0.071530
Loss after iteration 9000: 0.071357
Loss after iteration 10000: 0.071214
Loss after iteration 11000: 0.071092
Loss after iteration 12000: 0.070986
Loss after iteration 13000: 0.070894
Loss after iteration 14000: 0.070812
Loss after iteration 15000: 0.070739
Loss after iteration 16000: 0.070673
Loss after iteration 17000: 0.070613
Loss after iteration 18000: 0.070559
Loss after iteration 19000: 0.070509

4 - Exercises (3)

5 - Exercises (4)

6 - Exercises (5)

7 - 參考資料

http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/

https://github.com/dennybritz/nn-from-scratch

計算機視覺學習記錄 - Implementing a Neural Network from Scratch - An Introduction