1. 程式人生 > 實用技巧 >基於淺層神經網路(全連線網路)的強化學習演算法(Reinforce) 在訓練過程中出現梯度衰退(degenerate)的現象

基於淺層神經網路(全連線網路)的強化學習演算法(Reinforce) 在訓練過程中出現梯度衰退(degenerate)的現象

首先給出一個程式碼地址:

https://gitee.com/devilmaycry812839668/CartPole-PolicyNetwork

強化學習中的策略網路演算法。《TensorFlow實戰》一書中強化學習部分的策略網路演算法,模擬環境為gym的CartPole,本專案是對原書程式碼進行了部分重構,並加入了些中文註釋,同時給出了30次試驗的執行結果。

=======================================

可以看到,上面的程式碼是比較簡單的Reinforce演算法,其中策略函式使用淺層的三層神經網路(全連線),啟用函式使用Relu,進行了30次試驗,每次試驗進行了10000 個episodes的訓練,但是神奇的發現這30次試驗中居然第5次試驗,第21次試驗出現了嚴重的梯度衰退的想象。

給出梯度衰退時部分訓練結果:

Average reward for episode 1375 : 200.000000.
Average reward for episode 1400 : 200.000000.
Average reward for episode 1425 : 200.000000.
Average reward for episode 1450 : 200.000000.
Average reward for episode 1475 : 200.000000.
Average reward for episode 1500 : 200.000000.
Average reward for episode 1525 : 200.000000.
Average reward for episode 1550
: 192.480000. Average reward for episode 1575 : 140.440000. Average reward for episode 1600 : 104.240000. Average reward for episode 1625 : 20.080000. Average reward for episode 1650 : 12.560000. Average reward for episode 1675 : 10.720000. Average reward for episode 1700 : 11.080000. Average reward for episode 1725 : 12.000000. Average reward
for episode 1750 : 10.560000. Average reward for episode 1775 : 11.040000. Average reward for episode 1800 : 10.360000. Average reward for episode 1825 : 10.080000. Average reward for episode 1850 : 10.640000. Average reward for episode 1875 : 10.360000. Average reward for episode 1900 : 10.360000. Average reward for episode 1925 : 10.480000. Average reward for episode 1950 : 10.360000. Average reward for episode 1975 : 9.680000. Average reward for episode 2000 : 10.000000. Average reward for episode 2025 : 10.720000. Average reward for episode 2050 : 10.000000. Average reward for episode 2075 : 10.000000. Average reward for episode 2100 : 10.520000. Average reward for episode 2125 : 10.640000. Average reward for episode 2150 : 9.760000. Average reward for episode 2175 : 11.040000.
View Code

可以看到在第5次和第21次試驗中當訓練到一定episodes後訓練結果下降到極壞的水平(遠低於隨機策略的結果,隨機策略結果應該在26左右),因此我們可以發現這時的訓練已經發生了梯度衰退問題,degenerate問題。以前一直以為衰退問題只會出現在深層網路中,沒有想到在淺層網路中也發現了衰退現象。

查閱相關論文《Skip connections eliminate signulairites》 發現淺層網路也是會出現衰退現象的,解答了自己的疑問,原來淺層神經網路也是可能會出現衰退問題的。