《機器學習實戰》第二章 2.2用k-近鄰演算法改進約會網站的配對效果

阿新 • • 發佈：2019-01-20

《機器學習實戰》系列部落格主要是實現並理解書中的程式碼，相當於讀書筆記了。畢竟實戰不能光看書。動手就能遇到許多奇奇怪怪的問題。博文比較粗糙，需結合書本。博主邊查邊學，水平有限，有問題的地方評論區請多指教。書中的程式碼和資料，網上有很多請自行下載。

KNN演算法的應用

2.2.1 從文字檔案中解析資料

函式的輸入為檔名字串，輸出為訓練樣本矩陣和類標籤向量。

解析程式

def file2matrix(filename):
    fr = open(filename)
    numberOfLines = len(fr.readlines())         #get the number of lines in the file 

    returnMat = zeros((numberOfLines,3))        #prepare matrix to return
    classLabelVector = []                       #prepare labels return   
    fr = open(filename)
    index = 0
    for line in fr.readlines():
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index,:] = listFromLine[0 
:3]  #選前三列的資料存到矩陣中
        classLabelVector.append(int(listFromLine[-1]))#最後一列轉成整數後存到標籤向量
        index += 1
    return returnMat,classLabelVector

匯入資料成功，檢查一下資料

>>> import kNN
>>> datingDataMat,datingLabels = kNN.file2matrix('datingTestSet2.txt')
>>> datingDataMat
array([[  4.09200000e+04,   8.32697600e+00,   9.53952000e-01],
       [  1.44880000e+04,   7.15346900e+00,   1.67390400e+00],
       [  2.60520000e+04,   1.44187100e+00,   8.05124000e-01],
       ..., 
       [  2.65750000e+04,   1.06501020e+01,   8.66627000e-01],
       [  4.81110000e+04,   9.13452800e+00,   7.28045000e-01],
       [  4.37570000e+04,   7.88260100e+00,   1.33244600e+00]] 
)
>>> datingLabels[0:20]
[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]

2.2.2使用Matplotlib畫散點圖

kNN .py 程式裡繼續寫，注意要import Matplotlib

散點圖程式

datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))
plt.show()

這裡寫圖片描述

2.2.3歸一化數值

newValue = (oldValue-min)/(max-min)

歸一化程式

def autoNorm(dataSet):
    minVals = dataSet.min(0)  #每一列最小值
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = zeros(shape(dataSet)) #返回矩陣大小和資料矩陣一樣用0填充
    m = dataSet.shape[0]  #矩陣行數
    normDataSet = dataSet - tile(minVals, (m,1))
    normDataSet = normDataSet/tile(ranges, (m,1))   #element wise divide
    return normDataSet, ranges, minVals

>>> normMat
array([[ 0.44832535,  0.39805139,  0.56233353],
       [ 0.15873259,  0.34195467,  0.98724416],
       [ 0.28542943,  0.06892523,  0.47449629],
       ..., 
       [ 0.29115949,  0.50910294,  0.51079493],
       [ 0.52711097,  0.43665451,  0.4290048 ],
       [ 0.47940793,  0.3768091 ,  0.78571804]])
>>> ranges
array([  9.12730000e+04,   2.09193490e+01,   1.69436100e+00])
>>> minVals
array([ 0.      ,  0.      ,  0.001156])
>>>

2.2.4測試演算法：用錯誤率來檢測分類器的效能

分離器效能測試程式

def datingClassTest():
    hoRatio = 0.10      #選擇10%資料作測試
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt') 
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])
        if (classifierResult != datingLabels[i]): errorCount += 1.0
    print "the total error rate is: %f" % (errorCount/float(numTestVecs))
    print errorCount

這裡寫圖片描述
…

2.2.5構建完整可用的系統

預測程式

def classifyPerson():
    resultList = ['不喜歡','魅力一般的人','極具魅力的人']
    percentTats = float(raw_input("percentgage of time spent playing video game ?"))
    ffMile = float(raw_input("frequent flier miles earned per year ?"))
    iceCream = float(raw_input("liters of ice Cream consumed per year ?"))
    datingDataMat ,datingLabels = file2matrix('datingTestSet2.txt')
    normMat ,ranges ,minVals = autoNorm (datingDataMat)
    inArr = array([ffMile,percentTats,iceCream])
    classifierResult = classify0((inArr - minVals)/ranges, normMat,datingLabels,3)
    print "You will probably like this person: ",resultList[classifierResult-1]

>>> reload(kNN)
<module 'kNN' from 'kNN.py'>
>>> kNN.classifyPerson()
percentgage of time spent playing video game ? 10
frequent flier miles earned per year ?10000
liters of ice Cream consumed per year ?0.5
You will probably like this person:  魅力一般的人
>>>

《機器學習實戰》第二章 2.2用k-近鄰演算法改進約會網站的配對效果

2.2.1 從文字檔案中解析資料

解析程式

相關函式學習

2.2.2使用Matplotlib畫散點圖

散點圖程式

相關函式學習

2.2.3歸一化數值

歸一化程式

2.2.4測試演算法：用錯誤率來檢測分類器的效能

分離器效能測試程式

2.2.5構建完整可用的系統

預測程式

相關函式學習

《機器學習實戰》第二章 2.2用k-近鄰演算法改進約會網站的配對效果

機器學習實戰筆記2：使用K-近鄰演算法改進約會網站的配對效果

機器學習實戰（第二篇）-k-近鄰演算法改進約會網站配對結果

《機器學習實戰》第2章閱讀筆記3 使用K近鄰演算法改進約會網站的配對效果—分步驟詳細講解1——資料準備：從文字檔案中解析資料（附詳細程式碼及註釋）

機器學習實戰——KNN演算法改進約會網站配對效果

【機器學習實戰之一】：C++實現K-近鄰演算法KNN

機器學習—使用k-近鄰演算法改進約會網站的配對效果

使用k-近鄰演算法改進約會網站的配對效果--學習筆記（python3版本）

學習筆記：使用k-近鄰演算法改進約會網站的配對效果

機器學習實戰第二章----KNN

機器學習實戰-第二章代碼+註釋-KNN

機器學習實戰第二章——學習KNN演算法，讀書筆記

機器學習實戰第二章KNN（1）python程式碼及註釋

機器學習實戰第二章記錄

機器學習實戰筆記一：K-近鄰演算法在約會網站上的應用

機器學習實戰之使用k-鄰近演算法改進約會網站的配對效果

Python3 機器學習實戰自我講解（二） K-近鄰法-海倫約會-手寫字型識別

機器學習實戰ByMatlab（四）二分K-means演算法

機器學習實踐（七）—sklearn之K-近鄰演算法

2、K-近鄰演算法之約會網站預測

《機器學習實戰》第二章 2.2用k-近鄰演算法改進約會網站的配對效果

2.2.1 從文字檔案中解析資料

解析程式

相關函式學習

2.2.2使用Matplotlib畫散點圖

散點圖程式

相關函式學習

2.2.3歸一化數值

歸一化程式

2.2.4測試演算法：用錯誤率來檢測分類器的效能

分離器效能測試程式

2.2.5構建完整可用的系統

預測程式

相關函式學習

相關推薦