1. 程式人生 > 其它 >【教程】簡單教程:用Python解決簡單的水果分類問題

【教程】簡單教程:用Python解決簡單的水果分類問題

在這篇文章中,我們將使用Python中最流行的機器學習工具scikit- learn,在Python中實現幾種機器學習演算法。使用簡單的資料集來訓練分類器區分不同型別的水果。這篇文章的目的是識別出最適合當前問題的機器學習演算法。因此,我們要比較不同的演算法,選擇效能最好的演算法。讓我們開始吧!

資料

水果資料集由愛丁堡大學的Iain Murray博士建立。他買了幾十個不同種類的橘子、檸檬和蘋果,並把它們的尺寸記錄在一張桌子上。密歇根大學的教授們對水果資料進行了些微的格式化,可以從這裡下載。

下載地址:https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/fruit_data_with_colors.txt

讓我們先看一看資料的前幾行。

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')
fruits.head()

圖1

資料集的每一行表示一個水果塊,它由表中的幾個特性表示。

在資料集中有59個水果和7個特徵:

print(fruits.shape)

(59, 7)

在資料集中有四種水果:

print(fruits['fruit_name'].unique())

[“蘋果”柑橘”“橙子”“檸檬”]

除了柑橘,資料是相當平衡的。我們只好接著進行下一步。

print(fruits.groupby('fruit_name').size())

圖2

import seaborn as sns
sns.countplot(fruits['fruit_name'],label="Count")
plt.show()

圖3

視覺化

  • 每個數字變數的箱線圖將使我們更清楚地瞭解輸入變數的分佈:
fruits.drop('fruit_label', axis=1).plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False, figsize=(9,9), 
                                        title='Box Plot for each input variable')
plt.savefig('fruits_box')
plt.show()

圖4

  • 看起來顏色分值近似於高斯分佈。
import pylab as pl
fruits.drop('fruit_label' ,axis=1).hist(bins=30, figsize=(9,9))
pl.suptitle("Histogram for each numeric input variable")
plt.savefig('fruits_hist')
plt.show()

圖5

  • 一些成對的屬性是相關的(質量和寬度)。這表明了高度的相關性和可預測的關係。
from pandas.tools.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']
cmap = cm.get_cmap('gnuplot')
scatter = pd.scatter_matrix(X, c = y, marker = 'o', s=40, hist_kwds={'bins':15}, figsize=(9,9), cmap = cmap)
plt.suptitle('Scatter-matrix for each input variable')
plt.savefig('fruits_scatter_matrix')

圖6

統計摘要

圖7

我們可以看到數值沒有相同的縮放比例。我們需要將縮放比例擴充套件應用到我們為訓練集計算的測試集上。

建立訓練和測試集,並應用縮放比例

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

構建模型

邏輯迴歸

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(logreg.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(logreg.score(X_test, y_test)))

訓練集中邏輯迴歸分類器的精確度:0.70

測試集中邏輯迴歸分類器的精確度:0.40

決策樹

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier().fit(X_train, y_train)
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

訓練集中決策樹分類器的精確度:1.00

測試集中決策樹分類器的精確度:0.73

K-Nearest Neighbors(K-NN )

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
print('Accuracy of K-NN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))
print('Accuracy of K-NN classifier on test set: {:.2f}'
     .format(knn.score(X_test, y_test)))

訓練集中K-NN 分類器的精確度:0.95

測試集中K-NN 分類器的精確度:1.00

線性判別分析

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
print('Accuracy of LDA classifier on training set: {:.2f}'
     .format(lda.score(X_train, y_train)))
print('Accuracy of LDA classifier on test set: {:.2f}'
     .format(lda.score(X_test, y_test)))

訓練集中LDA分類器的精確度:0.86

測試集中LDA分類器的精確度:0.67

高斯樸素貝葉斯

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train, y_train)
print('Accuracy of GNB classifier on training set: {:.2f}'
     .format(gnb.score(X_train, y_train)))
print('Accuracy of GNB classifier on test set: {:.2f}'
     .format(gnb.score(X_test, y_test)))

訓練集中GNB分類器的精確度:0.86

測試集中GNB分類器的精確度:0.67

支援向量機

from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train, y_train)
print('Accuracy of SVM classifier on training set: {:.2f}'
     .format(svm.score(X_train, y_train)))
print('Accuracy of SVM classifier on test set: {:.2f}'
     .format(svm.score(X_test, y_test)))

訓練集中SVM分類器的精確度:0.61

測試集中SVM分類器的精確度:0.33

KNN演算法是我們嘗試過的最精確的模型。混淆矩陣提供了在測試集上沒有錯誤的指示。但是,測試集非常小。

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
pred = knn.predict(X_test)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))

圖8

繪製k-NN分類器的決策邊界

import matplotlib.cm as cm
from matplotlib.colors import ListedColormap, BoundaryNorm
import matplotlib.patches as mpatches
import matplotlib.patches as mpatches
X = fruits[['mass', 'width', 'height', 'color_score']]
y = fruits['fruit_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
def plot_fruit_knn(X, y, n_neighbors, weights):
    X_mat = X[['height', 'width']].as_matrix()
    y_mat = y.as_matrix()
# Create color maps
    cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF','#AFAFAF'])
    cmap_bold  = ListedColormap(['#FF0000', '#00FF00', '#0000FF','#AFAFAF'])

clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
    clf.fit(X_mat, y_mat)
# Plot the decision boundary by assigning a color in the color map
    # to each mesh point.
    
    mesh_step_size = .01  # step size in the mesh
    plot_symbol_size = 50
    
    x_min, x_max = X_mat[:, 0].min() - 1, X_mat[:, 0].max() + 1
    y_min, y_max = X_mat[:, 1].min() - 1, X_mat[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, mesh_step_size),
                         np.arange(y_min, y_max, mesh_step_size))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot training points
    plt.scatter(X_mat[:, 0], X_mat[:, 1], s=plot_symbol_size, c=y, cmap=cmap_bold, edgecolor = 'black')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
patch0 = mpatches.Patch(color='#FF0000', label='apple')
    patch1 = mpatches.Patch(color='#00FF00', label='mandarin')
    patch2 = mpatches.Patch(color='#0000FF', label='orange')
    patch3 = mpatches.Patch(color='#AFAFAF', label='lemon')
    plt.legend(handles=[patch0, patch1, patch2, patch3])
plt.xlabel('height (cm)')
plt.ylabel('width (cm)')
plt.title("4-Class classification (k = %i, weights = '%s')"
           % (n_neighbors, weights))    
plt.show()
plot_fruit_knn(X_train, y_train, 5, 'uniform')

圖9

k_range = range(1, 20)
scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train, y_train)
    scores.append(knn.score(X_test, y_test))
plt.figure()
plt.xlabel('k')
plt.ylabel('accuracy')
plt.scatter(k_range, scores)
plt.xticks([0,5,10,15,20])

圖10

對於這個特定的資料集,當k = 5時,我們獲得了最高精確度。

結語

在這篇文章中,我們關注的是預測的準確度。我們的目標是學習一個具有良好泛化效能的模型。這樣的模型使預測準確度最大化。通過比較不同的演算法,我們確定了最適合當前問題的機器學習演算法(即水果型別分類)。

原始碼地址:https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Solving%20A%20Simple%20Classification%20Problem%20with%20Python.ipynb