李巨集毅機器學習 P12 HW2 Winner or Loser 筆記（不使用框架實現使用MBGD優化方法和z_score標準化的logistic regression模型）

阿新 • • 發佈：2018-11-11

建立logistic迴歸模型：

根據ADULT資料集中一個人的age，workclass，fnlwgt，education，education_num，marital_status，occupation等資訊預測其income大於50K或者相反（收入）。

資料集：

ADULT資料集。

train.csv：https://ntumlta.github.io/2017fall-ml-hw2/raw_data/train.csv

test.csv：https://ntumlta.github.io/2017fall-ml-hw2/raw_data/test.csv

test資料集的真實標籤：https://ntumlta.github.io/2017fall-ml-hw2/correct_answer.csv

資料集官網：https://archive.ics.uci.edu/ml/datasets/Adult

train中的標籤為獨熱編碼，label = 0表示小於等於50K，label = 1表示大於50K。

對於測試資料，計算其預測的準確率。

先利用網格搜尋找出一組比較合適的w和b的初值：

我們先利用np生成一系列等分的陣列，作為1000組w的值和1000組b的值。

然後我們建立一個1000X1000的矩陣來儲存這1000X1000種可能的函式組合的cross_entropy的值。

樣本量大小我們取320，本資料集樣本總共為32561組，我們大概取了其中1%的資料。注意取樣本時我們採用了隨機數，取樣本時是隨機的。

程式碼如下：

from math import exp, log, pow, sqrt
import numpy as np
import csv


# 定義sigmoid函式對output值作處理
def sigmoid(out):
	out = 1.0 / (1 + exp(-out))
	return out


# 有限責任公司（Self-emp-inc）,無限責任公司（Self-emp-not-inc）,個人（Private）,聯邦政府（Federal-gov）,州政府（State-gov）,
#  地方政府（ Local-gov）,無工作經驗人員（Never-worked）,無薪人員（Without-pay）
work_class = {
	'Self-emp-inc': 1.0, 'Self-emp-not-inc': 2.0, 'Private': 3.0, 'Federal-gov': 4.0, 'State-gov': 5.0,
	'Local-gov': 6.0,
	'Never-worked': 7.0, 'Without-pay': 8.0, '?': 0.0}
# 教育情況：Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters,
# 1st-4th, 10th, Doctorate, 5th-6th, Preschool 
education = {'Bachelors': 1.0, 'Some-college': 2.0, '11th': 3.0, 'HS-grad': 4.0, 'Prof-school': 5.0, 'Assoc-acdm': 6.0,
			 'Assoc-voc': 7.0, '9th': 8.0, '7th-8th': 9.0, '12th': 10.0,
			 'Masters': 11.0, '1st-4th': 12.0, '10th': 13.0, 'Doctorate': 14.0, '5th-6th': 15.0, 'Preschool': 16.0}
# 已婚（Married-civ-spouse），再婚（Married-AF-spouse）,已婚配偶缺席（Married-spouse-absent）,離婚（Divorced）
# 離異（Separated），喪偶（Widowed）,未婚（Never-married）
marital_status = {'Married-civ-spouse': 1.0, 'Married-AF-spouse': 2.0, 'Married-spouse-absent': 3.0, 'Divorced': 4.0,
				  'Separated': 5.0, 'Widowed': 6.0, 'Never-married': 7.0}
# 職業（Occupation）：清潔工（Handlers-cleaners），維修工藝（Craft-repair），服務行業（Other-service）， 銷售（Sales），機床操控人員（Machine-op-inspct），
# 執行管理（Exec-managerial）， 專業教授（Prof-specialty），技術支援（Tech-support），行政文員（Adm-clerical），
# 養殖漁業（Farming-fishing）， 運輸行業（Transport-moving），私人房屋服務（Priv-house-serv），
# 保衛工作（Protective-serv）， 武裝部隊（Armed-Forces）
occupation = {'Handlers-cleaners': 1.0, 'Craft-repair': 2.0, 'Other-service': 3.0, 'Sales': 4.0,
			  'Machine-op-inspct': 5.0,
			  'Exec-managerial': 6.0, 'Prof-specialty': 7.0, 'Tech-support': 8.0, 'Adm-clerical': 9.0,
			  'Farming-fishing': 10.0,
			  'Transport-moving': 11.0, 'Priv-house-serv': 12.0, 'Protective-serv': 13.0, 'Armed-Forces': 14.0,
			  '?': 0.0}
# 妻子（Wife），子女（Own-child），丈夫（Husband），外來人員（Not-in-family）、 其他親戚（Other-relative）、 未婚（Unmarried）
relationship = {'Wife': 1.0, 'Husband': 2.0, 'Own-child': 3.0, 'Unmarried': 4.0, 'Other-relative': 5.0,
				'Not-in-family': 6.0}
# 白人（White），亞洲太平洋島民（Asian-Pac-Islander），阿米爾-印度-愛斯基摩人（Amer-Indian-Eskimo）、 其他（Other），黑人（Black）
race = {'White': 1.0, 'Black': 2.0, 'Asian-Pac-Islander': 3.0, 'Amer-Indian-Eskimo': 4.0, 'Other': 5.0}
sex = {'Female': 1.0, 'Male': 2.0}
# 美國（United-States）、 柬埔寨（Cambodia）、 英國（England），波多黎各（Puerto-Rico），加拿大（Canada），德國（Germany）
# 美國周邊地區（關島-美屬維爾京群島等）（Outlying-US(Guam-USVI-etc)），印度（India）、 日本（Japan）、 希臘（Greece）
# 美國南部（South）、 中國（China）、 古巴（Cuba）、 伊朗（Iran）、 宏都拉斯（Honduras），菲律賓（Philippines）
# 義大利（Italy）、 波蘭（Poland）、 牙買加（Jamaica）、 越南（Vietnam）、 墨西哥（Mexico）、 葡萄牙（Portugal）
# 愛爾蘭（Ireland）、 法國（France）、多明尼加共和國（Dominican-Republic）、 寮國（Laos）、 厄瓜多（Ecuador）
# 臺灣（Taiwan）、 海地（Haiti）、 哥倫比亞（Columbia）、 匈牙利（Hungary）、 瓜地馬拉（Guatemala）、 尼加拉瓜（Nicaragua）
# 蘇格蘭（Scotland）、 泰國（Thailand）、 南斯拉夫（Yugoslavia），薩爾瓦多（El-Salvador）、 千里達及托巴哥（Trinadad&Tobago）
# 祕魯（Peru），香港（Hong），荷蘭（Holland-Netherlands）

Nation_country = {'United-States': 1.0, 'Cambodia': 2.0, 'England': 3.0, 'Puerto-Rico': 4.0, 'Canada': 5.0,
				  'Germany': 6.0, 'Outlying-US(Guam-USVI-etc)': 7.0, 'India': 8.0, 'Japan': 9.0, 'Greece': 10.0,
				  'South': 11.0, 'China': 12.0, 'Cuba': 13.0, 'Iran': 14.0, 'Honduras': 15.0, 'Philippines': 16.0,
				  'Italy': 17.0, 'Poland': 18.0, 'Jamaica': 19.0, 'Vietnam': 20.0, 'Mexico': 21.0, 'Portugal': 22.0,
				  'Ireland': 23.0, 'France': 24.0, 'Dominican-Republic': 25.0, 'Laos': 26.0, 'Ecuador': 27.0,
				  'Taiwan': 28.0, 'Haiti': 29.0, 'Columbia': 30.0, 'Hungary': 31.0, 'Guatemala': 32.0,
				  'Nicaragua': 33.0, 'Scotland': 34.0, 'Thailand': 35.0, 'Yugoslavia': 36.0, 'El-Salvador': 37.0,
				  'Trinadad&Tobago': 38.0, 'Peru': 39.0, 'Hong': 40.0, 'Holand-Netherlands': 41.0, '?': 0.0}
income = {'>50K': 1.0, '<=50K': 0.0}

train_x_data_set = []
train_y_data_set = []

# 從原始資料中提取出用於train的x資料集和y資料集
with open('train.csv', 'r', encoding='UTF-8', errors='ignore') as csv_file:
	all_lines = csv.reader(csv_file)
	# 遍歷train.csv的所有行
	for one_line in all_lines:
		if one_line[0] == 'age':
			continue
		one_line_x_data = []
		one_line_y_data = 0.0
		for i, element in enumerate(one_line):
			# 去除字串首尾的空格
			element = element.strip()
			if i == 1:
				one_line_x_data.append(work_class[element])
			elif i == 3:
				one_line_x_data.append(education[element])
			elif i == 5:
				one_line_x_data.append(marital_status[element])
			elif i == 6:
				one_line_x_data.append(occupation[element])
			elif i == 7:
				one_line_x_data.append(relationship[element])
			elif i == 8:
				one_line_x_data.append(race[element])
			elif i == 9:
				one_line_x_data.append(sex[element])
			elif i == 13:
				one_line_x_data.append(Nation_country[element])
			elif i == 14:
				one_line_y_data = income[element]
			else:
				one_line_x_data.append(int(element))
		train_x_data_set.append(one_line_x_data)
		train_y_data_set.append(one_line_y_data)


# 測試，一共32561組樣本
# print(len(train_x_data_set))
# print(len(train_y_data_set))

# 測試
# for i in range(len(train_x_data_set)):
# 	print(train_x_data_set[i])
# 	print(train_y_data_set[i])


# 對資料作z_score標準化處理，注意，我對所有項的資料都進行了z_score標準化，原因是有幾項資料的數量級太大，直接計算時sigmoid函式計算會溢位
def data_z_score_standardization():
	global train_x_data_set
	# 對train_x_data_set中所有的資料都進行歸一化
	for index in range(len(train_x_data_set[0])):
		# 取出需要歸一化的一組資料
		data_list = []
		for one_data in train_x_data_set:
			data_list.append(one_data[index])
		# 計算這組資料的平均值
		data_mean = 0.0
		for data in data_list:
			data_mean = data_mean + data
		data_mean = data_mean / len(data_list)
		# 計算這組資料的方差
		data_variance = 0.0
		for data in data_list:
			data_variance = data_variance + pow((data - data_mean), 2)
		data_variance = data_variance / len(data_list)
		# 計算這組資料的標準差
		data_standard_deviation = sqrt(data_variance)
		# 將train_x_data_set的相關資料標準化
		for subscript, one_data in enumerate(train_x_data_set):
			train_x_data_set[subscript][index] = (one_data[index] - data_mean) / data_standard_deviation
	return


# 除了上面的z_score標準化，還寫了一個0-1_normalization有興趣可以兩種標準化方法都試一下
# def zero_one_normalization():
# 	global train_x_data_set
# 	x_min = 0.0
# 	x_max = 0.0
# 	for index in range(len(train_x_data_set[0])):
# 		for one_data in train_x_data_set:
# 			if one_data[index] > x_max:
# 				x_max = one_data[index]
# 			elif one_data[index] < x_min:
# 				x_min = one_data[index]
# 		for subscript, one_data in enumerate(train_x_data_set):
# 			train_x_data_set[subscript][index] = (one_data[index] - x_min) / (x_max - x_min)
# 	return


data_z_score_standardization()


# 測試
# for i in range(len(train_x_data_set)):
# 	print(train_x_data_set[i])


# 返回batch_size個樣本的下標，即我們抽取的用於訓練的樣本的下標
def get_next_batch(bt_size, x_data_set):
	bt_index = np.arange(len(x_data_set))
	# 設一個隨機數種子，這樣每次打亂時的順序是一樣的，除非修改種子值
	np.random.seed(2401)
	# 打亂資料的下標的次序
	np.random.shuffle(bt_index)
	# 取出bt_size個下標值
	bt_index = bt_index[0:bt_size]
	return bt_index


batch_size = 320
# 注意get_next_batch函式中取樣本是隨機取的，每次取的結果不一樣，會導致每次用樣本訓練的結果也會不同，我們這裡用了隨機數種子使每次隨機的結果固定
batch_index = get_next_batch(batch_size, train_x_data_set)

# y=b+w0*x0+w1*x1+w2*x2+w3*x3+w4*x4+w5*x5+w6*x6+w7*x7+w8*x8+w9*x9+w10*x10+w11*x11+w12*x12+w13*x13
# 先進行網格搜尋，找出一組比較適合作為初始w和b的值
X = np.arange(-1.25, 1.25, step=0.0025)
# 生成1000個b的值
Y = np.arange(-0.7, 0.7, step=0.0001)
# 生成1000組w的值，每組w有14個值w0-w13
cross_entropy = np.zeros((len(X), len(X)))
# z是一個值全為0，len(X)行len(X)列的矩陣,即1000組w*1000組b的不同組合，每種組合都計算一下它們的cross_entropy值（樣本為batch_index）
b_min = 0.0
w_min = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
cross_entropy_min = 10000
for i in range(len(X)):
	for j in range(len(X)):
		# 雙層for迴圈，手動計算方差和的平均值
		b = X[i]
		w = Y[j * 14:j * 14 + 14]
		cross_entropy[i][j] = 0.0
		for key in batch_index:
			# 計算每種w和b的組合的cross_entropy值
			y_pred = b + w[0] * train_x_data_set[key][0] + w[1] * train_x_data_set[key][1] + w[2] * \
					 train_x_data_set[key][2] + w[3] * train_x_data_set[key][3] + w[4] * train_x_data_set[key][4] + w[
						 5] * train_x_data_set[key][5] + w[6] * train_x_data_set[key][6] + w[7] * train_x_data_set[key][
						 7] + w[8] * train_x_data_set[key][8] + w[9] * train_x_data_set[key][9] + +w[10] * \
					 train_x_data_set[key][10] + w[11] * train_x_data_set[key][11] + w[12] * train_x_data_set[key][12] + \
					 w[13] * train_x_data_set[key][13]
			y_pred = sigmoid(y_pred)
			cross_entropy[i][j] = cross_entropy[i][j] - (
					train_y_data_set[key] * log(y_pred) + (1 - train_y_data_set[key]) * log(1 - y_pred))
		# 最後求得的是這一組樣本的交叉熵的和的平均值
		cross_entropy[i][j] = cross_entropy[i][j] / batch_size
		print(cross_entropy[i][j])
		print(w, b)
		# 下方的兩個break是臨時加的，目的是為了只計算一次最內層迴圈看一下一次迴圈計算的結果
		if cross_entropy[i][j] < cross_entropy_min:
			cross_entropy_min = cross_entropy[i][j]
			b_min = b
			w_min[0:] = w[0:]
		print("最佳引數：cross_entropy_min={}\nw_min={},b_min={}".format(cross_entropy_min, w_min, b_min))

執行這段程式碼，沒有等到它執行完時，我們就發現cross_entropy_min已經很長時間不變了，我們取出這組cross_entropy_min對應的w和b值：

最佳引數：cross_entropy_min=0.5318484553883494
w_min=[0.11619999999991015, 0.11629999999991014, 0.11639999999991013, 0.11649999999991012, 0.11659999999991011, 0.1166999999999101, 0.11679999999991009, 0.11689999999991008, 0.11699999999991006, 0.11709999999991005, 0.11719999999991004, 0.11729999999991003, 0.11739999999991002, 0.11749999999991001],b_min=-1.207500000000001

我們可以根據上面的值將網路搜尋的範圍縮小，再搜尋一次，即修改其中幾行程式碼變成下面的形式：

X = np.arange(-1.3, -1.1, step=0.0002)
Y = np.arange(0.04, 0.18, step=0.00001)

再次執行上面的程式碼，執行一段時間後，待cross_entropy_min不再變化時，就可以取出這組cross_entropy_min對應的w和b值（不用等程式碼執行完）：

最佳引數：cross_entropy_min=0.5318789944169704
w_min=[0.11812000000002393, 0.11813000000002394, 0.11814000000002392, 0.11815000000002393, 0.11816000000002394, 0.11817000000002395, 0.11818000000002393, 0.11819000000002394, 0.11820000000002395, 0.11821000000002396, 0.11822000000002394, 0.11823000000002395, 0.11824000000002396, 0.11825000000002397],b_min=-1.2336000000000074

如我們取到了上面一組引數。將這組w和b值作為初始的w和b。

使用MBGD優化方法和z_score標準化的logistic regression模型：

首先處理輸入資料，我們可以看到對於每一個樣本一共有15個不同類的資料。對於連續值的資料，我們可以直接將其變為float型錄入，對於離散值的資料（很多類的那種），我們採用字典將其轉換成float數值；

處理好資料後，得到train_x_data_set，train_y_data_set。由於train_x_data_set中有一些資料的量綱很大，因此我們要先使用函式data_z_score_standardization將其歸一化；

使用get_next_batch函式取出一定數量的隨機樣本用於訓練，注意這裡設定了種子，種子值不變則每次隨機的結果一樣；

定義線性函式linear_function，交叉熵損失函式cross_entropy_function，優化方法函式（這裡使用MBGD優化方法）gd_update_b_grad_and_w_grad_and_w_and_b；

定義初始w、b、lr、iteration，注意w和b是由上面網格搜尋得到的；

訓練模型iteration次，記錄每次的cross_entropy、w、b，並得出一組最好的b_train_best，w_train_best；

畫一個cross_entropy值隨時間變化的影象；

使用get_next_batch函式取出一定數量的隨機樣本用於測試，注意這裡每次的種子都不一樣；

定義一個accuracy函式計算準確率；

進行test_iteration輪測試，計算每輪測試的準確率。

程式碼如下：

from math import exp, log, pow, sqrt
import numpy as np
import matplotlib.pyplot as plt
import random
import csv

# 有限責任公司（Self-emp-inc）,無限責任公司（Self-emp-not-inc）,個人（Private）,聯邦政府（Federal-gov）,州政府（State-gov）,
#  地方政府（ Local-gov）,無工作經驗人員（Never-worked）,無薪人員（Without-pay）
work_class = {
	'Self-emp-inc': 1.0, 'Self-emp-not-inc': 2.0, 'Private': 3.0, 'Federal-gov': 4.0, 'State-gov': 5.0,
	'Local-gov': 6.0,
	'Never-worked': 7.0, 'Without-pay': 8.0, '?': 0.0}
# 教育情況：Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters,
# 1st-4th, 10th, Doctorate, 5th-6th, Preschool 
education = {'Bachelors': 1.0, 'Some-college': 2.0, '11th': 3.0, 'HS-grad': 4.0, 'Prof-school': 5.0, 'Assoc-acdm': 6.0,
			 'Assoc-voc': 7.0, '9th': 8.0, '7th-8th': 9.0, '12th': 10.0,
			 'Masters': 11.0, '1st-4th': 12.0, '10th': 13.0, 'Doctorate': 14.0, '5th-6th': 15.0, 'Preschool': 16.0}
# 已婚（Married-civ-spouse），再婚（Married-AF-spouse）,已婚配偶缺席（Married-spouse-absent）,離婚（Divorced）
# 離異（Separated），喪偶（Widowed）,未婚（Never-married）
marital_status = {'Married-civ-spouse': 1.0, 'Married-AF-spouse': 2.0, 'Married-spouse-absent': 3.0, 'Divorced': 4.0,
				  'Separated': 5.0, 'Widowed': 6.0, 'Never-married': 7.0}
# 職業（Occupation）：清潔工（Handlers-cleaners），維修工藝（Craft-repair），服務行業（Other-service）， 銷售（Sales），機床操控人員（Machine-op-inspct），
# 執行管理（Exec-managerial）， 專業教授（Prof-specialty），技術支援（Tech-support），行政文員（Adm-clerical），
# 養殖漁業（Farming-fishing）， 運輸行業（Transport-moving），私人房屋服務（Priv-house-serv），
# 保衛工作（Protective-serv）， 武裝部隊（Armed-Forces）
occupation = {'Handlers-cleaners': 1.0, 'Craft-repair': 2.0, 'Other-service': 3.0, 'Sales': 4.0,
			  'Machine-op-inspct': 5.0,
			  'Exec-managerial': 6.0, 'Prof-specialty': 7.0, 'Tech-support': 8.0, 'Adm-clerical': 9.0,
			  'Farming-fishing': 10.0,
			  'Transport-moving': 11.0, 'Priv-house-serv': 12.0, 'Protective-serv': 13.0, 'Armed-Forces': 14.0,
			  '?': 0.0}
# 妻子（Wife），子女（Own-child），丈夫（Husband），外來人員（Not-in-family）、 其他親戚（Other-relative）、 未婚（Unmarried）
relationship = {'Wife': 1.0, 'Husband': 2.0, 'Own-child': 3.0, 'Unmarried': 4.0, 'Other-relative': 5.0,
				'Not-in-family': 6.0}
# 白人（White），亞洲太平洋島民（Asian-Pac-Islander），阿米爾-印度-愛斯基摩人（Amer-Indian-Eskimo）、 其他（Other），黑人（Black）
race = {'White': 1.0, 'Black': 2.0, 'Asian-Pac-Islander': 3.0, 'Amer-Indian-Eskimo': 4.0, 'Other': 5.0}
sex = {'Female': 1.0, 'Male': 2.0}
# 美國（United-States）、 柬埔寨（Cambodia）、 英國（England），波多黎各（Puerto-Rico），加拿大（Canada），德國（Germany）
# 美國周邊地區（關島-美屬維爾京群島等）（Outlying-US(Guam-USVI-etc)），印度（India）、 日本（Japan）、 希臘（Greece）
# 美國南部（South）、 中國（China）、 古巴（Cuba）、 伊朗（Iran）、 宏都拉斯（Honduras），菲律賓（Philippines）
# 義大利（Italy）、 波蘭（Poland）、 牙買加（Jamaica）、 越南（Vietnam）、 墨西哥（Mexico）、 葡萄牙（Portugal）
# 愛爾蘭（Ireland）、 法國（France）、多明尼加共和國（Dominican-Republic）、 寮國（Laos）、 厄瓜多（Ecuador）
# 臺灣（Taiwan）、 海地（Haiti）、 哥倫比亞（Columbia）、 匈牙利（Hungary）、 瓜地馬拉（Guatemala）、 尼加拉瓜（Nicaragua）
# 蘇格蘭（Scotland）、 泰國（Thailand）、 南斯拉夫（Yugoslavia），薩爾瓦多（El-Salvador）、 千里達及托巴哥（Trinadad&Tobago）
# 祕魯（Peru），香港（Hong），荷蘭（Holland-Netherlands）

Nation_country = {'United-States': 1.0, 'Cambodia': 2.0, 'England': 3.0, 'Puerto-Rico': 4.0, 'Canada': 5.0,
				  'Germany': 6.0, 'Outlying-US(Guam-USVI-etc)': 7.0, 'India': 8.0, 'Japan': 9.0, 'Greece': 10.0,
				  'South': 11.0, 'China': 12.0, 'Cuba': 13.0, 'Iran': 14.0, 'Honduras': 15.0, 'Philippines': 16.0,
				  'Italy': 17.0, 'Poland': 18.0, 'Jamaica': 19.0, 'Vietnam': 20.0, 'Mexico': 21.0, 'Portugal': 22.0,
				  'Ireland': 23.0, 'France': 24.0, 'Dominican-Republic': 25.0, 'Laos': 26.0, 'Ecuador': 27.0,
				  'Taiwan': 28.0, 'Haiti': 29.0, 'Columbia': 30.0, 'Hungary': 31.0, 'Guatemala': 32.0,
				  'Nicaragua': 33.0, 'Scotland': 34.0, 'Thailand': 35.0, 'Yugoslavia': 36.0, 'El-Salvador': 37.0,
				  'Trinadad&Tobago': 38.0, 'Peru': 39.0, 'Hong': 40.0, 'Holand-Netherlands': 41.0, '?': 0.0}
income = {'>50K': 1.0, '<=50K': 0.0}

train_x_data_set = []
train_y_data_set = []

# 從原始資料中提取出用於train的x資料集和y資料集
with open('train.csv', 'r', encoding='UTF-8', errors='ignore') as csv_file:
	all_lines = csv.reader(csv_file)
	# 遍歷train.csv的所有行
	for one_line in all_lines:
		if one_line[0] == 'age':
			continue
		one_line_x_data = []
		one_line_y_data = 0.0
		for i, element in enumerate(one_line):
			# 去除字串首尾的空格
			element = element.strip()
			if i == 1:
				one_line_x_data.append(work_class[element])
			elif i == 3:
				one_line_x_data.append(education[element])
			elif i == 5:
				one_line_x_data.append(marital_status[element])
			elif i == 6:
				one_line_x_data.append(occupation[element])
			elif i == 7:
				one_line_x_data.append(relationship[element])
			elif i == 8:
				one_line_x_data.append(race[element])
			elif i == 9:
				one_line_x_data.append(sex[element])
			elif i == 13:
				one_line_x_data.append(Nation_country[element])
			elif i == 14:
				one_line_y_data = income[element]
			else:
				one_line_x_data.append(float(element))
		train_x_data_set.append(one_line_x_data)
		train_y_data_set.append(one_line_y_data)


# 測試，一共32561組樣本
# print(len(train_x_data_set))
# print(len(train_y_data_set))

# 測試
# for i in range(len(train_x_data_set)):
# 	print(train_x_data_set[i])
# 	print(train_y_data_set[i])


# 對資料作z_score標準化處理，注意，我對所有項的資料都進行了z_score標準化，原因是有幾項資料的數量級太大，直接計算時sigmoid函式計算會溢位
def data_z_score_standardization(x_data_set):
	# 對x_data_set中所有的資料都進行歸一化
	for index in range(len(x_data_set[0])):
		# 取出需要歸一化的一組資料
		data_list = []
		for one_data in x_data_set:
			data_list.append(one_data[index])
		# 計算這組資料的平均值
		data_mean = 0.0
		for data in data_list:
			data_mean = data_mean + data
		data_mean = data_mean / len(data_list)
		# 計算這組資料的方差
		data_variance = 0.0
		for data in data_list:
			data_variance = data_variance + pow((data - data_mean), 2)
		data_variance = data_variance / len(data_list)
		# 計算這組資料的標準差
		data_standard_deviation = sqrt(data_variance)
		# 將train_x_data_set的相關資料標準化
		for subscript, one_data in enumerate(x_data_set):
			x_data_set[subscript][index] = (one_data[index] - data_mean) / data_standard_deviation
	return x_data_set


# 定義sigmoid函式對output值作處理
def sigmoid(out):
	out = 1.0 / (1 + exp(-out))
	return out


# 除了上面的z_score標準化，還寫了一個0-1_normalization有興趣可以兩種標準化方法都試一下
# def zero_one_normalization(x_data_set):
# 	x_min = 0.0
# 	x_max = 0.0
# 	for index in range(len(x_data_set[0])):
# 		for one_data in x_data_set:
# 			if one_data[index] > x_max:
# 				x_max = one_data[index]
# 			elif one_data[index] < x_min:
# 				x_min = one_data[index]
# 		for subscript, one_data in enumerate(x_data_set):
# 			x_data_set[subscript][index] = (one_data[index] - x_min) / (x_max - x_min)
# 	return x_data_set


train_x_data_set = data_z_score_standardization(train_x_data_set)


# 測試
# for i in range(len(train_x_data_set)):
# 	print(train_x_data_set[i])


# 返回batch_size個樣本的下標，即我們抽取的用於訓練的樣本的下標
def get_next_batch(bt_size, x_data_set, sed):
	bt_index = np.arange(len(x_data_set))
	# 設一個隨機數種子，這樣每次打亂時的順序是一樣的，除非修改種子值
	np.random.seed(sed)
	# 打亂資料的下標的次序
	np.random.shuffle(bt_index)
	# 取出bt_size個下標值
	bt_index = bt_index[0:bt_size]
	return bt_index


train_batch_size = 320
# 取樣本
# 注意get_next_batch函式中取樣本是隨機取的，每次取的結果不一樣，會導致每次用樣本訓練的結果也會不同，我們這裡用了隨機數種子使每次隨機的結果固定
batch_train_index = get_next_batch(train_batch_size, train_x_data_set, 2401)


# 定義線性函式
def linear_function(bdata, wdata, key, x_data_set):
	y = bdata + wdata[0] * x_data_set[key][0] + wdata[1] * x_data_set[key][1] + wdata[2] * \
		x_data_set[key][2] + wdata[3] * x_data_set[key][3] + wdata[4] * \
		x_data_set[key][4] + wdata[5] * x_data_set[key][5] + wdata[6] * \
		x_data_set[key][6] + wdata[7] * x_data_set[key][7] + wdata[8] * \
		x_data_set[key][8] + wdata[9] * x_data_set[key][9] + wdata[10] * \
		x_data_set[key][10] + wdata[11] * x_data_set[key][11] + wdata[12] * \
		x_data_set[key][12] + wdata[13] * x_data_set[key][13]
	return y


# 定義cross_entropy函式,bt_index即輸入的x資料的下標，wdata即設定的w0-w13，bdata即設定的b
def cross_entropy_function(bt_index, x_data_set, y_data_set, wdata, bdata):
	crs_entropy = 0.0
	for index in bt_index:
		# 計算一組樣本的cross_entropy的平均值
		y_p = linear_function(bdata, wdata, index, x_data_set)
		y_p = sigmoid(y_p)
		crs_entropy = crs_entropy - (
				y_data_set[index] * log(y_p) + (1 - y_data_set[index]) * log(1 - y_p))
	return crs_entropy / len(bt_index)


# 定義b的初始值
b = -1.2336000000000074
# 定義w0-w13的初始值
w = [0.11812000000002393, 0.11813000000002394, 0.11814000000002392, 0.11815000000002393, 0.11816000000002394,
	 0.11817000000002395, 0.11818000000002393, 0.11819000000002394, 0.11820000000002395, 0.11821000000002396,
	 0.11822000000002394, 0.11823000000002395, 0.11824000000002396, 0.11825000000002397]
# 注意上面的初始值是我們進行網路搜尋時找到的一組cross_entropy值較小的值
# 定義學習速率lr的初始值，lr的值設定可以參照我們模擬手動計算時設定的步長，再設小一些
lrw, lrb = 0.00000001, 0.0000001
# 定義訓練次數iteration
train_iteration = 2000
# 定義b_history、w_history、loss_history用來儲存訓練過程中更新的w、b、和loss
b_history = [b]
w_history = [w]
# loss_history初始值需要計算，用初始的w和b輸入函式loss_function1進行計算
cross_entropy_history = [cross_entropy_function(batch_train_index, train_x_data_set, train_y_data_set, w, b)]

# 定義w偏導數和b的偏導數初始值為0
b_grad = 0.0
w_grad = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


# 定義更新w的偏導數、b的偏導數、w和b的函式
# bgrad即w的偏導數，bgrad即b的偏導數，bt_index即輸入的x資料的下標組，wdata即設定的w0-w13，bdata即設定的b
def gd_update_b_grad_and_w_grad_and_w_and_b(bgrad, wgrad, bt_index, x_data_set, y_data_set, wdata, bdata):
	# bt_index的元素個數即用於更新w和b的這批樣本的數量，即表示了我們用多大規模的資料來更新一次w和b的梯度
	# 梯度即cross_entropy函式的關於w和b的所有偏導數
	for index in bt_index:
		bgrad = bgrad - y_data_set[index] * (1 - sigmoid(linear_function(bdata, wdata, index, x_data_set))) * 1.0 + (
				1 - y_data_set[index]) * sigmoid(linear_function(bdata, wdata, index, x_data_set)) * 1.0
		for subscript in range(len(w_grad)):
			wgrad[subscript] = wgrad[subscript] - y_data_set[index] * (
					1 - sigmoid(linear_function(bdata, wdata, index, x_data_set))) * x_data_set[index][subscript] + (
									   1 - y_data_set[index]) * sigmoid(
				linear_function(bdata, wdata, index, x_data_set)) * x_data_set[index][subscript]
	# 用偏導數乘以學習速率來更新w和b
	bdata = bdata - lrb * bgrad
	for subscript in range(len(w_grad)):
		wdata[subscript] = wdata[subscript] - lrw * w_grad[subscript]
	return bdata, wdata


# 測試
# w_out, b_out = gd_update_b_grad_and_w_grad_and_w_and_b(b_grad, w_grad, batch_index, train_x_data_set, train_y_data_set,
# 													   w, b)
# print(w_out, b_out)

# 儲存訓練得出的最好的w和b值，並記錄此時的training error
b_train_best = 0.0
w_train_best = []
training_error_best = 10000.0

# 訓練模型
for i in range(train_iteration):
	b_temp, w_temp = gd_update_b_grad_and_w_grad_and_w_and_b(b_grad, w_grad, batch_train_index, train_x_data_set,
															 train_y_data_set, w, b)
	# 每訓練一次儲存更新的w和b
	# 並且將更新的w和b的值賦值給w和b變數
	# 也就是說每訓練一次更新了4個變數：b_grad、w_grad、w、b，b_grad和w_grad直接在函式體內更新，w和b要記錄更新後的值，所以作為函式返回值
	cross_entropy_history.append(
		cross_entropy_function(batch_train_index, train_x_data_set, train_y_data_set, w_temp, b_temp))
	if cross_entropy_history[-1] < training_error_best:
		training_error_best = cross_entropy_history[-1]
		b_train_best = b_temp
		w_train_best.clear()
		[w_train_best.append(i) for i in w_temp]
	b_history.append(b_temp)
	w_history.append(w_temp)
	print("iteration:{} cross_entropy={} w={} b={}".format(i, cross_entropy_history[-1], w_temp, b_temp))
# 得到訓練出來的最佳w、b、Loss
print(training_error_best, b_train_best, w_train_best)

# 建立一個影象
x_axis = []
for i in range(train_iteration):
	x_axis.append(i)
x_axis.append(train_iteration)
plt.plot(x_axis, cross_entropy_history, 'o-', markersize=3, linewidth=1.5, color='black')
# # 'o-'中o表示實心圈標記，-表示實線
plt.xlim()
plt.ylim()
plt.xlabel(r'$train_iteration$', fontsize=16)
plt.ylabel(r'$train_cross_entropy$', fontsize=16)
# 設定x軸和y軸的標籤和標籤的大小
plt.show()
# 顯示影象

# 處理測試檔案test.csv，得到測試集的x資料
test_x_data_set = []
with open('test.csv', 'r', encoding='UTF-8', errors='ignore') as csv_file:
	all_lines = csv.reader(csv_file)
	# 遍歷train.csv的所有行
	for one_line in all_lines:
		if one_line[0] == 'age':
			continue
		one_line_x_data = []
		for i, element in enumerate(one_line):
			# 去除字串首尾的空格
			element = element.strip()
			if i == 1:
				one_line_x_data.append(work_class[element])
			elif i == 3:
				one_line_x_data.append(education[element])
			elif i == 5:
				one_line_x_data.append(marital_status[element])
			elif i == 6:
				one_line_x_data.append(occupation[element])
			elif i == 7:
				one_line_x_data.append(relationship[element])
			elif i == 8:
				one_line_x_data.append(race[element])
			elif i == 9:
				one_line_x_data.append(sex[element])
			elif i == 13:
				one_line_x_data.append(Nation_country[element])
			else:
				one_line_x_data.append(float(element))
		test_x_data_set.append(one_line_x_data)

# 處理測試檔案test.csv，得到測試集的x資料
test_y_data_set = []
with open('correct_answer.csv', 'r', encoding='UTF-8', errors='ignore') as csv_file:
	all_lines = csv.reader(csv_file)
	# 遍歷train.csv的所有行
	for one_line in all_lines:
		if one_line[0] == 'id':
			continue
		one_line_y_data = 0.0
		for i, element in enumerate(one_line):
			# 去除字串首尾的空格
			element = element.strip()
			if i == 0:
				continue
			elif i == 1:
				one_line_y_data = one_line_y_data + float(element)
		test_y_data_set.append(one_line_y_data)

# 測試用的x資料也要歸一化
test_x_data_set = data_z_score_standardization(test_x_data_set)
test_iteration = 3
test_batch_size = 320


# 計算模型預測的準確率
def accuracy(bt_index, x_data_set, y_data_set, wdata, bdata):
	y_l_list = []
	res_list = []
	for index in bt_index:
		# 計算一組樣本的cross_entropy的平均值
		y_p = linear_function(bdata, wdata, index, x_data_set)
		y_p = sigmoid(y_p)
		y_l = 0.0
		res = 0.0
		if y_p >= 0.5:
			y_l = 1.0
		if y_l == y_data_set[index]:
			res = 1.0
		res_list.append(res)
		y_l_list.append(y_l)
	acr = 0.0
	for res in res_list:
		acr = acr + res
	acr = acr / len(bt_index)
	return y_l_list, acr


# 測試
# seed_value = random.randint(0, 10000)
# batch_test_index = get_next_batch(test_batch_size, test_x_data_set, seed_value)
# y_pred_label_list, acc = accuracy(batch_test_index, test_x_data_set, test_y_data_set, w_train_best, b_train_best)
# print(y_pred_label_list)
# print(acc)

for i in range(test_iteration):
	# 取一批測試樣本
	seed_value = random.randint(0, 10000)
	batch_test_index = get_next_batch(test_batch_size, test_x_data_set, seed_value)
	# 計算該批樣本預測的準確率
	y_pred_label_list, acc = accuracy(batch_test_index, test_x_data_set, test_y_data_set, w_train_best, b_train_best)
	# 計算該批樣本的cross_entropy
	test_cross_entropy = cross_entropy_function(batch_test_index, test_x_data_set, test_y_data_set, w_train_best,
												b_train_best)
	num = 0
	for j in batch_test_index:
		print("樣本編號：{} 預測標籤：{} 真實標籤：{}".format(j, y_pred_label_list[num], test_y_data_set[j]))
		num = num + 1
	print("test_iteration:{}\n acc:{}\n test_cross_entropy:{}".format(i, acc, test_cross_entropy))

執行結果如下：

生成的訓練時的cross_entropy值的變化圖。

得到的最佳引數cross_entropy、b和w：

0.3851662645508392 -1.233601624702985 [0.39661073902905447, 0.057156146772790635, 0.06768751743168605, 0.09053575022632332, 0.6186871098635961, -0.43385310606433936, 0.15738231823194487, -0.26448422319592063, -0.0567938200499726, 0.3312379050966214, 0.3390591816951852, 0.13720501938348878, 0.41144536493112943, -0.11103552637912766]

測試結果如下（測試程式碼即在訓練模型的程式碼之後）：

test_iteration:0
 acc:0.796875
 test_cross_entropy:0.4040875069037342

test_iteration:1
 acc:0.821875
 test_cross_entropy:0.3909683414258476

test_iteration:2
 acc:0.81875
 test_cross_entropy:0.3850121737082529

Process finished with exit code 0

可以看到準確率大概在0.8左右。因為這次的模型只是為了自己實現一下logistic regression的相關原理，其model和local minima值找的不是很好，導致準確率不算太高。有興趣的同學可以改動一下model（比如改成更高次的model）和網格搜尋的範圍，找一組更好的model和local minima值試一下。本文主要是為了用程式碼來實現logistic regression模型。

李巨集毅機器學習 P12 HW2 Winner or Loser 筆記（不使用框架實現使用MBGD優化方法和z_score標準化的logistic regression模型）

建立logistic迴歸模型：

資料集：

先利用網格搜尋找出一組比較合適的w和b的初值：

使用MBGD優化方法和z_score標準化的logistic regression模型：

李巨集毅機器學習 P12 HW2 Winner or Loser 筆記（不使用框架實現使用MBGD優化方法和z_score標準化的logistic regression模型）

李巨集毅機器學習 P18 Tips for Training DNN 筆記

【ML】李巨集毅機器學習筆記

李巨集毅機器學習 P14 Backpropagation 筆記

李巨集毅機器學習 P13 Brief Introduction of Deep Learning 筆記

李巨集毅機器學習P11 Logistic Regression 筆記

李巨集毅機器學習 P15 “Hello world” of deep learning 筆記

線性迴歸李巨集毅機器學習HW1

李巨集毅機器學習課程--迴歸(Regression)

李巨集毅機器學習P7 Gradient Descent (Demo by AOE) 筆記、P8 Gradient Descent (Demo by Minecraft) 筆記

李巨集毅機器學習PTT的理解（1）深度學習的介紹

卷積神經網路CNN |李巨集毅機器學習

李巨集毅機器學習筆記——02.Where does the error come from ?

李巨集毅機器學習-學習筆記

李巨集毅機器學習2016 第八講深度學習網路優化小訣竅

[機器學習入門] 李巨集毅機器學習筆記-1（Learning Map 課程導覽圖）

李巨集毅機器學習2016 第十五講無監督學習生成模型之 VAE

[機器學習入門] 李巨集毅機器學習筆記-5（Classification- Probabilistic Generative Model；分類：概率生成模型）

[機器學習入門] 李巨集毅機器學習筆記-15 （Unsupervised Learning: Word Embedding；無監督學習：詞嵌入）

2018-3-21李巨集毅機器學習視訊筆記（十三）--“Hello Wrold” of Deep learning

李巨集毅機器學習 P12 HW2 Winner or Loser 筆記（不使用框架實現使用MBGD優化方法和z_score標準化的logistic regression模型）

建立logistic迴歸模型：

資料集：

先利用網格搜尋找出一組比較合適的w和b的初值：

使用MBGD優化方法和z_score標準化的logistic regression模型：

相關推薦