強化學習小案例（十九）

阿新 • • 發佈：2019-02-06

強化學習的小案例：

假設有一個表格，4*4的，想要從其中的一個點到達左上角的點和右下角的點。

比如這樣一個表格，想要從中間其中的一個點到達左上角的點，（0,0）或者是右下角的點（3,3）。

首先先要創造一個環境。

'''
建立一個格子世界
'''
import numpy as np
from gym.envs.toy_text import discrete
#設定方向
UP = 0
RIGHT = 1
DOWN = 2
LEFT = 3
class GridWorldEnv(discrete.DiscreteEnv):
    def __init__(self , shape = [4,4]):
        self 
.shape = shape
        self.nS = np.prod(self.shape)
        self.nA = 4
Max_y = shape[0]
        Max_x = shape[1]
        p = {}
        grid = np.arange(self.nS).reshape(shape)
        it = np.nditer(grid , flags = ['multi_index'])
        while not it.finished:
            s = it.iterindex
            y , x = it.multi_index
            p[s] = {a : [] for  
a in range(self.nA)}
            is_done = lambda s :s == 0 or s == (self.nS - 1)
            reward = 0.0 if is_done(s) else -1.0
if is_done(s):
                p[s][UP] = [(1.0 , s , reward , True)]
                p[s][RIGHT] = [(1.0 , s , reward , True)]
                p[s][DOWN] = [(1.0 , s , reward , True 
)]
                p[s][LEFT] = [(1.0 , s , reward , True)]
            else:
                np_UP = s if y == 0 else s - Max_x
                np_RIGHT = s if x == (Max_x - 1) else s + 1
np_DOWN = s if y == (Max_y - 1) else s + Max_x
                np_LEFT = s if x == 0 else s - 1
p[s][UP] = [(1.0 , np_UP , reward , is_done(np_UP))]
                p[s][RIGHT] = [(1.0 , np_RIGHT , reward , is_done(np_RIGHT))]
                p[s][DOWN] = [(1.0 , np_DOWN , reward , is_done(np_DOWN))]
                p[s][LEFT] = [(1.0 , np_LEFT , reward , is_done(np_LEFT))]
                pass
it.iternext()
        pass
self.p = p
    pass

先設定向上右下左走的標識。然後就是要設定一個類來初始化環境了。

下面是程式碼的解釋：

        self.shape = shape
        self.nS = np.prod(self.shape)
        self.nA = 4
Max_y = shape[0]
        Max_x = shape[1]

np.prod函式是乘法，比如傳進去[4,5,6]，那麼就是120。而這裡傳進去的就是4,4，得到6就是代表元素的個數了。然後Max_y和Max_x得到橫排和豎排的個數。

        p = {}
        grid = np.arange(self.nS).reshape(shape)
        it = np.nditer(grid , flags = ['multi_index'])

grid是建立一個4x4的表格，而np.nditer是為這個表格索引排序。索引完之後就是得到了0,1,2,3,4,5,6,7等等。

while not it.finished:
            s = it.iterindex
            y , x = it.multi_index
            p[s] = {a : [] for a in range(self.nA)}
            is_done = lambda s :s == 0 or s == (self.nS - 1)
            reward = 0.0 if is_done(s) else -1.0

s得到的就是一個索引值，比如（0,0）號就得到0，（0,1）就得到1，以此類推，換行的話繼續相加。y , x就是得到x,y的座標了。比如（1,3）就是得到y = 1,x = 3。p[s]其實每一個格子可以走或者是走之後的狀態，每一個格子有四種走法，上右下左，每一個走法又有下一個狀態，獎勵值。所以一個格子有四種狀態。is_done就是是否到達終點。reward就是獎勵，除了終點，其他點都是負數的獎勵。

if is_done(s):
                p[s][UP] = [(1.0 , s , reward , True)]
                p[s][RIGHT] = [(1.0 , s , reward , True)]
                p[s][DOWN] = [(1.0 , s , reward , True)]
                p[s][LEFT] = [(1.0 , s , reward , True)]
            else:
                np_UP = s if y == 0 else s - Max_x
                np_RIGHT = s if x == (Max_x - 1) else s + 1
np_DOWN = s if y == (Max_y - 1) else s + Max_x
                np_LEFT = s if x == 0 else s - 1
p[s][UP] = [(1.0 , np_UP , reward , is_done(np_UP))]
                p[s][RIGHT] = [(1.0 , np_RIGHT , reward , is_done(np_RIGHT))]
                p[s][DOWN] = [(1.0 , np_DOWN , reward , is_done(np_DOWN))]
                p[s][LEFT] = [(1.0 , np_LEFT , reward , is_done(np_LEFT))]
                pass

最後就是一個賦值操作了。第一個是概率值，因為預設所有的行為都是一樣的，所以直接給相同即可，因為以後會根據期望值來進行選擇最高。第二個就是下一個狀態，下一個狀態其實不是一個準確的類別，在這裡就是代表到達下一個的位置，因為如果is_done為True就是到達了終點，那麼沒有下一個狀態了。而reward就是一個獎勵，在上幾行程式碼有講。最後就是是否到達終點，到達就是True，否則就是False。

最後不要忘記迭代器指下一個，否則一直都是0的。

之後就是迭代進行優化獎勵矩陣了。

需要一直更新，不斷迭代。

整體程式碼：

'''
移動測試
'''
import numpy as np
from learning.GridWorld import GridWorldEnv
env = GridWorldEnv()
def value_iteration(env , theta = 0.001 , discount = 0.7):
    def one_step(state , v):
        A = np.zeros(env.nA)
        for a in range(env.nA):
            for prob, next_state, reward, done in env.p[state][a]:
                A[a] += prob * (reward + discount * v[next_state])
        return A
    v = np.zeros(env.nS)
    while True:
        delta = 0
for s in range(env.nS):
            A = one_step(s , v)
            best_action = np.max(A)
            delta = max(delta , np.abs(best_action - v[s]))
            v[s] = best_action
        if delta < theta:
            break
        pass
# Create a deterministic policy using the optimal value function
policy = np.zeros([env.nS, env.nA])
    for s in range(env.nS):
        # One step lookahead to find the best action for this state
A = one_step(s , v)
        best_action = np.argmax(A)
        # Always take the best action
policy[s, best_action] = 1.0
return policy, v

policy , v = value_iteration(env)
print(np.reshape(np.argmax(policy, axis=1), env.shape))
print(np.reshape(v , [4,4]))

首先當然是得到環境了。

def value_iteration(env , theta = 0.001 , discount = 0.7):
    def one_step(state , v):
        A = np.zeros(env.nA)
        for a in range(env.nA):
            for prob, next_state, reward, done in env.p[state][a]:
                A[a] += prob * (reward + discount * v[next_state])
        return A

之後就是一個迭代更新了。env就是環境，theta就是迭代停止的條件，discount就是一個折扣值，也就是未來的期望對當前結果的影響。one_step函式就是得到一個表格的各個方向的期望。然後取一個最大的值作為方向。v[next_state]就是下一個狀態的值了。這個值是要不斷更新的。

    v = np.zeros(env.nS)
    while True:
        delta = 0
for s in range(env.nS):
            A = one_step(s , v)
            best_action = np.max(A)
            delta = max(delta , np.abs(best_action - v[s]))
            v[s] = best_action
        if delta < theta:
            break
        pass

接下來就是開始迴圈了。

delta就是差值，差值要小於theta。

不斷迭代，best_action就是得到最好的一個方向。

# Create a deterministic policy using the optimal value function
policy = np.zeros([env.nS, env.nA])
    for s in range(env.nS):
        # One step lookahead to find the best action for this state
A = one_step(s , v)
        best_action = np.argmax(A)
        # Always take the best action
policy[s, best_action] = 1.0
return policy, v

policy , v = value_iteration(env)
print(np.reshape(np.argmax(policy, axis=1), env.shape))
print(np.reshape(v , [4,4]))

最後就是得到輸出了。

就是每個位置走的方向了。比如在（0,1）點。3是意思就是向下走。

這個就是一個v，也就是未來期望的一個權值，不斷更新過後的最終值。

# Create a deterministic policy using the optimal value function
policy = np.zeros([env.nS, env.nA])
    for s in range(env.nS):
        # One step lookahead to find the best action for this state
A = one_step(s , v)
        best_action = np.argmax(A)
        # Always take the best action
policy[s, best_action] = 1.0
return policy, v

policy , v = value_iteration(env)
print(np.reshape(np.argmax(policy, axis=1), env.shape))
print(np.reshape(v , [4,4]))

強化學習小案例（十九）

強化學習小案例（十九）

Python小白學習之路（十九）—【檔案操作步驟】【檔案操作模式】

Hadoop學習之路（十九）MapReduce框架排序

微信小程式（十九）——表單資料提交和小程式表單賦值（組裝資料）

學習OpenCV範例（十九）——輪廓提取和形狀描述符

微信小程式學習筆記（十九）video視訊

Python學習筆記（十九）

ShaderLab學習小結（十九）RenderToCubemap創建能反射周圍環境的效果

python學習筆記（十九）面向對象編程，類

Linux學習筆記（十九）文件壓縮

Linux 學習總結（十九）正則三劍客之grep

C++語言學習（十九）——C++類型識別

GO語言學習（十九）Go 錯誤處理

深度學習（十九）基於空間金字塔池化的卷積神經網路物體檢測

機器學習之python學習（十九）

java基礎學習總結（十九）：Unsafe與CAS

Python小白學習之路（十四）—【作用域】【匿名函式】【程式設計方法論】【高階函式】

c++ primer第五版----學習筆記（十九）Ⅱ

c++ primer第五版----學習筆記（十九）Ⅰ

Python小白學習之路（十五）—【map()函式】【filter()函式】【reduce()函式】

強化學習小案例（十九）

相關推薦