神經網路優化演算法一(梯度下降、學習率設定)
1、梯度下降法
梯度下降演算法主要用於優化單個引數的取值,而反向傳播演算法給出了一個高效的方式在所有的引數上使用梯度下降演算法,從而使得神經網路模型在訓練資料上的損失函式儘可能小。反向傳播演算法是訓練神經網路的核心演算法,它可以根據定義好的損失函式優化神經網路中引數的取值,從而使神經網路的模型在訓練資料集上的損失函式達到一個較小值。
假設用θ表示神經網路中的引數,J(θ)表示在給定的引數取值下,訓練資料集上損失函式的大小,那麼整個優化過程可以抽象為尋找一個引數θ,使得J(θ) 最小。因為目前沒有一個通用的方法可以對任意損失函式直接求解最佳的引數取值,所以在實踐中,梯度下降演算法是最常用的神經網路優化方法。如圖是梯度下降演算法的原理。
x 軸表示引數θ的取值,y軸表示損失函式J(θ)的值。假設當前的引數和損失值對應圖中小圓點的位置,那麼梯度下降演算法會將引數向x軸左側移動,從而使得小圓點朝著箭頭的方向移動。假設學習率為η,那麼引數更新的公式為:其中偏導部分就是該引數在小圓點處的梯度,總是沿著負梯度方向移動。
注意:梯度下降演算法並不能保證優化的函式達到全域性最優解。如下圖所示:
當在小圓黑點位置,偏導等於0或者接近0時,引數就不會再進一步更新。如果 x 的初始值落在右側深色的區間中,那麼通過梯度下降得到的結果都會落到小黑點代表的區域性最優解。
因此:由此可見在訓練神經網路時,引數的初始值會很大程度影響最後得到的結果。只有當損失函式為凸函式時,梯度下降演算法才能保證達到全域性最優解。
除了不一定能達到全域性最優外,梯度下降演算法的另一個問題就是計算時間太長。因為要在全部訓練資料上最小化損失,所以損失函式J(θ)是在所有訓練資料上的損失和。
例子:為損失函式,採用梯度下降法最小化損失函式。
程式碼如下:
#coding:utf-8 #損失函式 loss=(w+1)^2,令w初值是常數5.反向傳播就是求最優的w,即是求最小的loss對應的w值。 import tensorflow as tf #定義待優化的引數w的初值為5 w=tf.Variable(tf.constant(5,dtype=tf.float32)) #定義損失函式loss loss=tf.square(w+1) #tensorflow支援直接寫算術式,loss=(w+1)^2也是可以的。 #定義反向傳播方法 train_step=tf.train.GradientDescentOptimizer(1).minimize(loss) #生成會話,訓練40輪 with tf.Session() as sess: init_op=tf.global_variables_initializer() sess.run(init_op) print "訓練之前w的值為:",sess.run(w),sess.run(loss) for i in range(40): sess.run(train_step) w_val=sess.run(w) loss_val=sess.run(loss) print "after %d steps: w is %f, loss is %f,"%(i,w_val,loss_val)
執行結果:
(tf1.5) [email protected]:~/tf/tf4$ python loss.py
2018-12-11 16:16:52.387120: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN
After 0 steps: w is 2.600000, loss is 12.959999.
After 1 steps: w is 1.160000, loss is 4.665599.
After 2 steps: w is 0.296000, loss is 1.679616.
After 3 steps: w is -0.222400, loss is 0.604662.
After 4 steps: w is -0.533440, loss is 0.217678.
After 5 steps: w is -0.720064, loss is 0.078364.
After 6 steps: w is -0.832038, loss is 0.028211.
After 7 steps: w is -0.899223, loss is 0.010156.
After 8 steps: w is -0.939534, loss is 0.003656.
After 9 steps: w is -0.963720, loss is 0.001316.
After 10 steps: w is -0.978232, loss is 0.000474.
After 11 steps: w is -0.986939, loss is 0.000171.
After 12 steps: w is -0.992164, loss is 0.000061.
After 13 steps: w is -0.995298, loss is 0.000022.
After 14 steps: w is -0.997179, loss is 0.000008.
After 15 steps: w is -0.998307, loss is 0.000003.
After 16 steps: w is -0.998984, loss is 0.000001.
After 17 steps: w is -0.999391, loss is 0.000000.
After 18 steps: w is -0.999634, loss is 0.000000.
After 19 steps: w is -0.999781, loss is 0.000000.
After 20 steps: w is -0.999868, loss is 0.000000.
After 21 steps: w is -0.999921, loss is 0.000000.
After 22 steps: w is -0.999953, loss is 0.000000.
After 23 steps: w is -0.999972, loss is 0.000000.
After 24 steps: w is -0.999983, loss is 0.000000.
After 25 steps: w is -0.999990, loss is 0.000000.
After 26 steps: w is -0.999994, loss is 0.000000.
After 27 steps: w is -0.999996, loss is 0.000000.
After 28 steps: w is -0.999998, loss is 0.000000.
After 29 steps: w is -0.999999, loss is 0.000000.
After 30 steps: w is -0.999999, loss is 0.000000.
After 31 steps: w is -1.000000, loss is 0.000000.
After 32 steps: w is -1.000000, loss is 0.000000.
After 33 steps: w is -1.000000, loss is 0.000000.
After 34 steps: w is -1.000000, loss is 0.000000.
After 35 steps: w is -1.000000, loss is 0.000000.
After 36 steps: w is -1.000000, loss is 0.000000.
After 37 steps: w is -1.000000, loss is 0.000000.
After 38 steps: w is -1.000000, loss is 0.000000.
After 39 steps: w is -1.000000, loss is 0.000000.
2、神經網路進一步優化——學習率設定
上述介紹了學習率在梯度下降演算法中的使用,學習率決定了引數每次更新的幅度。學習率過大會造成搖擺,過小會造成訓練時間過長,為了解決學習率的問題,TensorFlow提供了一種更加靈活的學習率設定方法 -- 指數衰減法。
#tf.train.exponential_decay函式實現了指數衰減學習率。
decayed_learning_rate = learning_rate * decay_rate ^ (global_step / decay_steps)
其中:decayed_learning_rate 為每一輪優化時使用的學習率,learning_rate 為學習率的初始值,decay_rate為衰減係數,global_step記錄了當前訓練的輪數,為不可訓練的引數。decay_steps為衰減速度。global_step / decay_steps表示的是多少輪更新一次,如下圖顯示了隨著迭代輪數的增加,學習率逐步降低的過程,迭代輪數就是總訓練樣本數除以每一個 batch 中的訓練樣本數。
實驗程式碼如下:
#coding:utf-8
#設損失函式 loss=(w+1)^2, 令w初值是常數10。反向傳播就是求最優w,即求最小loss對應的w值
#使用指數衰減的學習率,在迭代初期得到較高的下降速度,可以在較小的訓練輪數下取得更有收斂度。
import tensorflow as tf
LEARNING_RATE_BASE = 0.1 #最初學習率
LEARNING_RATE_DECAY = 0.99 #學習率衰減率
LEARNING_RATE_STEP = 1 #喂入多少輪BATCH_SIZE後,更新一次學習率(更新速度為:輪/次),一般設為:總樣本數/BATCH_SIZE
#運行了幾輪BATCH_SIZE的計數器,初值給0, 設為不被訓練
global_step = tf.Variable(0, trainable=False) #tranable=False,標註當前的訓練輪數為不可訓練
#定義指數下降學習率
#通過 exponential_decay 函式生成學習率,初始學習率為 0.1
learning_rate = tf.train.exponential_decay(LEARNING_RATE_BASE, global_step, LEARNING_RATE_STEP, LEARNING_RATE_DECAY, staircase=True)
#定義待優化引數,初值給10
w = tf.Variable(tf.constant(5, dtype=tf.float32))
#定義損失函式loss
loss = tf.square(w+1)
#定義反向傳播方法
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
#生成會話,訓練40輪
with tf.Session() as sess:
init_op=tf.global_variables_initializer()
sess.run(init_op)
for i in range(40):
sess.run(train_step)
learning_rate_val = sess.run(learning_rate)
global_step_val = sess.run(global_step)
w_val = sess.run(w)
loss_val = sess.run(loss)
print "After %s steps: global_step is %f, w is %f, learning rate is %f, loss is %f" % (i, global_step_val, w_val, learning_rate_val, loss_val)
執行結果:
(tf1.5) [email protected]:~/tf/tf4$ python opt4_5.py
2018-12-11 16:02:25.500643: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN
After 0 steps: global_step is 1.000000, w is 3.800000, learning rate is 0.099000, loss is 23.040001
After 1 steps: global_step is 2.000000, w is 2.849600, learning rate is 0.098010, loss is 14.819419
After 2 steps: global_step is 3.000000, w is 2.095001, learning rate is 0.097030, loss is 9.579033
After 3 steps: global_step is 4.000000, w is 1.494386, learning rate is 0.096060, loss is 6.221961
After 4 steps: global_step is 5.000000, w is 1.015167, learning rate is 0.095099, loss is 4.060896
After 5 steps: global_step is 6.000000, w is 0.631886, learning rate is 0.094148, loss is 2.663051
After 6 steps: global_step is 7.000000, w is 0.324608, learning rate is 0.093207, loss is 1.754587
After 7 steps: global_step is 8.000000, w is 0.077684, learning rate is 0.092274, loss is 1.161403
After 8 steps: global_step is 9.000000, w is -0.121202, learning rate is 0.091352, loss is 0.772287
After 9 steps: global_step is 10.000000, w is -0.281761, learning rate is 0.090438, loss is 0.515867
After 10 steps: global_step is 11.000000, w is -0.411674, learning rate is 0.089534, loss is 0.346128
After 11 steps: global_step is 12.000000, w is -0.517024, learning rate is 0.088638, loss is 0.233266
After 12 steps: global_step is 13.000000, w is -0.602644, learning rate is 0.087752, loss is 0.157891
After 13 steps: global_step is 14.000000, w is -0.672382, learning rate is 0.086875, loss is 0.107334
After 14 steps: global_step is 15.000000, w is -0.729305, learning rate is 0.086006, loss is 0.073276
After 15 steps: global_step is 16.000000, w is -0.775868, learning rate is 0.085146, loss is 0.050235
After 16 steps: global_step is 17.000000, w is -0.814036, learning rate is 0.084294, loss is 0.034583
After 17 steps: global_step is 18.000000, w is -0.845387, learning rate is 0.083451, loss is 0.023905
After 18 steps: global_step is 19.000000, w is -0.871193, learning rate is 0.082617, loss is 0.016591
After 19 steps: global_step is 20.000000, w is -0.892476, learning rate is 0.081791, loss is 0.011561
After 20 steps: global_step is 21.000000, w is -0.910065, learning rate is 0.080973, loss is 0.008088
After 21 steps: global_step is 22.000000, w is -0.924629, learning rate is 0.080163, loss is 0.005681
After 22 steps: global_step is 23.000000, w is -0.936713, learning rate is 0.079361, loss is 0.004005
After 23 steps: global_step is 24.000000, w is -0.946758, learning rate is 0.078568, loss is 0.002835
After 24 steps: global_step is 25.000000, w is -0.955125, learning rate is 0.077782, loss is 0.002014
After 25 steps: global_step is 26.000000, w is -0.962106, learning rate is 0.077004, loss is 0.001436
After 26 steps: global_step is 27.000000, w is -0.967942, learning rate is 0.076234, loss is 0.001028
After 27 steps: global_step is 28.000000, w is -0.972830, learning rate is 0.075472, loss is 0.000738
After 28 steps: global_step is 29.000000, w is -0.976931, learning rate is 0.074717, loss is 0.000532
After 29 steps: global_step is 30.000000, w is -0.980378, learning rate is 0.073970, loss is 0.000385
After 30 steps: global_step is 31.000000, w is -0.983281, learning rate is 0.073230, loss is 0.000280
After 31 steps: global_step is 32.000000, w is -0.985730, learning rate is 0.072498, loss is 0.000204
After 32 steps: global_step is 33.000000, w is -0.987799, learning rate is 0.071773, loss is 0.000149
After 33 steps: global_step is 34.000000, w is -0.989550, learning rate is 0.071055, loss is 0.000109
After 34 steps: global_step is 35.000000, w is -0.991035, learning rate is 0.070345, loss is 0.000080
After 35 steps: global_step is 36.000000, w is -0.992297, learning rate is 0.069641, loss is 0.000059
After 36 steps: global_step is 37.000000, w is -0.993369, learning rate is 0.068945, loss is 0.000044
After 37 steps: global_step is 38.000000, w is -0.994284, learning rate is 0.068255, loss is 0.000033
After 38 steps: global_step is 39.000000, w is -0.995064, learning rate is 0.067573, loss is 0.000024
After 39 steps: global_step is 40.000000, w is -0.995731, learning rate is 0.066897, loss is 0.000018