1. 程式人生 > >Stanford機器學習課程筆記1-Linear Regression與Logistic Regression

Stanford機器學習課程筆記1-Linear Regression與Logistic Regression

轉載請註明出處: http://xiahouzuoxin.github.io/notes/

課程計劃

主講人Andrew Ng是機器學習界的大牛,創辦最大的公開課網站coursera,前段時間還聽說加入了百度。他講的機器學習課程可謂每個學計算機的人必看。整個課程的大綱大致如下:

  1. Introduction (1 class) Basic concepts.
  2. Supervised learning. (6 classes) Supervised learning setup. LMS. Logistic regression. Perceptron. Exponential family. Generative learning algorithms. Gaussian discriminant analysis.
    Naive Bayes. Support vector machines. Model selection and feature selection. Ensemble methods: Bagging, boosting, ECOC.
  3. Learning theory. (3 classes) Bias/variance tradeoff. Union and Chernoff/Hoeffding bounds. VC dimension. Worst case (online) learning. Advice on using learning algorithms.
  4. Unsupervised learning. (5 classes) Clustering. K-means. EM. Mixture of Gaussians. Factor analysis. PCA. MDS. pPCA. Independent components analysis (ICA).
  5. Reinforcement learning and control. (4 classes) MDPs. Bellman equations. Value iteration. Policy iteration. Linear quadratic regulation (LQR). LQG. Q-learning. Value function approximation. Policy search. Reinforce. POMDPs.

本筆記主要是關於Linear Regression和Logistic Regression部分的學習實踐記錄。

Linear Regression與預測問題

舉了一個房價預測的例子,

Area(feet^2)#bedroomsPrice(1000$)
20143400
16003330
24003369
14162232
30004540
36704620
45005800

Assume:房價與“面積和臥室數量”是線性關係,用線性關係進行放假預測。因而給出線性模型, hθ(x) = ∑θTx ,其中 x = [x1, x2] , 分別對應面積和臥室數量。 為得到預測模型,就應該根據表中已有的資料擬合得到引數 θ 的值。課程通過從概率角度進行解釋(主要用到了大數定律即“線性擬合模型的誤差滿足高斯分佈”的假設,通過最大似然求導就能得到下面的表示式)為什麼應該求解如下的最小二乘表示式來達到求解引數的目的,

上述 J(θ) 稱為cost function, 通過 minJ(θ) 即可得到擬合模型的引數。

解 minJ(θ) 的方法有多種, 包括Gradient descent algorithm和Newton's method,這兩種都是運籌學的數值計算方法,非常適合計算機運算,這兩種演算法不僅適合這裡的線性迴歸模型,對於非線性模型如下面的Logistic模型也適用。除此之外,Andrew Ng還通過線性代數推導了最小均方的演算法的閉合數學形式,

Gradient descent algorithm執行過程中又分兩種方法:batch gradient descent和stochastic gradient descent。batch gradient descent每次更新 θ 都用到所有的樣本資料,而stochastic gradient descent每次更新則都僅用單個的樣本資料。兩者更新過程如下:

  1. batch gradient descent

  2. stochastic gradient descent

    for i=1 to m

兩者只不過一個將樣本放在了for迴圈上,一者放在了。事實證明,只要選擇合適的學習率 α , Gradient descent algorithm總是能收斂到一個接近最優值的值。學習率選擇過大可能造成cost function的發散,選擇太小,收斂速度會變慢。

關於收斂條件,Andrew Ng沒有在課程中提到更多,我給出兩種收斂準則:

  1. J小於一定值收斂
  2. 兩次迭代之間的J差值,即|J-J_pre|<一定值則收斂

下面是使用batch gradient descent擬合上面房價問題的例子,

clear all;
clc

%% 原資料
x = [2014, 1600, 2400, 1416, 3000, 3670, 4500;...
    3,3,3,2,4,4,5;];
y = [400 330 369 232 540 620 800];

error = Inf;
threshold = 4300;
alpha = 10^(-10);
x = [zeros(1,size(x,2)); x];  % x0 = 0,擬合常數項
theta = [0;0;0]; % 常數項為0
J = 1/2*sum((y-theta'*x).^2);
costs = [];
while error > threshold
    tmp = y-theta'*x;
    theta(1) = theta(1) + alpha*sum(tmp.*x(1,:));
    theta(2) = theta(2) + alpha*sum(tmp.*x(2,:));
    theta(3) = theta(3) + alpha*sum(tmp.*x(3,:));
%     J_last = J;
    J = 1/2*sum((y-theta'*x).^2);
%     error = abs(J-J_last);
    error = J;
    costs =[costs, error];
end

%% 繪製
figure,
subplot(211);
plot3(x(2,:),x(3,:),y, '*');
grid on;
xlabel('Area');
ylabel('#bedrooms');
zlabel('Price(1000$)');

hold on;
H = theta'*x;
plot3(x(2,:),x(3,:),H,'r*');

hold on
hx(1,:) = zeros(1,20);
hx(2,:) = 1000:200:4800;
hx(3,:) = 1:20;
[X,Y] = meshgrid(hx(2,:), hx(3,:));
H = theta(2:3)'*[X(:)'; Y(:)'];
H = reshape(H,[20,20]);
mesh(X,Y,H);

legend('原資料', '對原資料的擬合', '擬合平面');

subplot(212);
plot(costs, '.-');
grid on
title('error=J(\theta)的迭代收斂過程');
xlabel('迭代次數');
ylabel('J(\theta)');

擬合及收斂過程如下:

不管是梯度下降,還是線性迴歸模型,都是工具!!分析結果更重要,從上面的擬合平面可以看到,影響房價的主要因素是面積而非臥室數量。

很多情況下,模型並不是線性的,比如股票隨時間的漲跌過程。這種情況下, hθ(x) = θTx 的假設不再成立。此時,有兩種方案:

  1. 建立一個非線性的模型,比如指數或者其它的符合資料變化的模型
  2. 區域性線性模型,對每段資料進行區域性建立線性模型。Andrew Ng課堂上講解了Locally Weighted Linear Regression,即區域性加權的線性模型

Locally Weighted Linear Regression

其中權值的一種好的選擇方式是:

Logistic Regression與分類問題

Linear Regression解決的是連續的預測和擬合問題,而Logistic Regression解決的是離散的分類問題。兩種方式,但本質殊途同歸,兩者都可以算是指數函式族的特例。

在分類問題中,y取值在{0,1}之間,因此,上述的Liear Regression顯然不適應。修改模型如下

該模型稱為Logistic函式或Sigmoid函式。為什麼選擇該函式,我們看看這個函式的圖形就知道了,

Sigmoid函式範圍在[0,1]之間,引數 θ 只不過控制曲線的陡峭程度。以0.5為截點,>0.5則y值為1,< 0.5則y值為0,這樣就實現了兩類分類的效果。

假設 P(y = 1|x; θ) = hθ(x) , P(y = 0|x; θ) = 1 − hθ(x) , 寫得更緊湊一些,

對m個訓練樣本,使其似然函式最大,則有

同樣的可以用梯度下降法求解上述的最大值問題,只要將最大值求解轉化為求最小值,則迭代公式一模一樣,

最後的梯度下降方式和Linear Regression一致。我做了個例子(資料集連結),下面是Logistic的Matlab程式碼,

function Logistic

clear all;
close all
clc

data = load('LogisticInput.txt');
x = data(:,1:2);
y = data(:,3);

% Plot Original Data
figure,
positive = find(y==1);
negtive = find(y==0);
hold on
plot(x(positive,1), x(positive,2), 'k+', 'LineWidth',2, 'MarkerSize', 7);
plot(x(negtive,1), x(negtive,2), 'bo', 'LineWidth',2, 'MarkerSize', 7);

% Compute Likelihood(Cost) Function
[m,n] = size(x);
x = [ones(m,1) x];
theta = zeros(n+1, 1);
[cost, grad] = cost_func(theta, x, y);
threshold = 0.1;
alpha = 10^(-1);
costs = [];
while cost > threshold
    theta = theta + alpha * grad;
    [cost, grad] = cost_func(theta, x, y);
    costs = [costs cost];
end

% Plot Decision Boundary 
hold on
plot_x = [min(x(:,2))-2,max(x(:,2))+2];
plot_y = (-1./theta(3)).*(theta(2).*plot_x + theta(1));
plot(plot_x, plot_y, 'r-');
legend('Positive', 'Negtive', 'Decision Boundary')
xlabel('Feature Dim1');
ylabel('Feature Dim2');
title('Classifaction Using Logistic Regression');

% Plot Costs Iteration
figure,
plot(costs, '*');
title('Cost Function Iteration');
xlabel('Iterations');
ylabel('Cost Fucntion Value');

end

function g=sigmoid(z)
g = 1.0 ./ (1.0+exp(-z));
end

function [J,grad] = cost_func(theta, X, y)
% Computer Likelihood Function and Gradient
m = length(y); % training examples
hx = sigmoid(X*theta);
J = (1./m)*sum(-y.*log(hx)-(1.0-y).*log(1.0-hx));
grad = (1./m) .* X' * (y-hx);
end

判決邊界(Decesion Boundary)的計算是令h(x)=0.5得到的。當輸入新的資料,計算h(x):h(x)>0.5為正樣本所屬的類,h(x)< 0.5 為負樣本所屬的類。

特徵對映與過擬合(over-fitting)

這部分在Andrew Ng課堂上沒有講,參考了網路上的資料。

上面的資料可以通過直線進行劃分,但實際中存在那麼種情況,無法直接使用直線判決邊界(看後面的例子)。

為解決上述問題,必須將特徵對映到高維,然後通過非直線判決介面進行劃分。特徵對映的方法將已有的特徵進行多項式組合,形成更多特徵,

上面將二維特徵對映到了2階(還可以對映到更高階),這便於形成非線性的判決邊界。

但還存在問題,儘管上面方法便於對非線性的資料進行劃分,但也容易由於高維特性造成過擬合。因此,引入泛化項應對過擬合問題。似然函式新增泛化項後變成,

此時梯度下降演算法發生改變,

最後來個例子,樣本資料鏈接,對應的含泛化項和特徵對映的matlab分類程式碼如下:

function LogisticEx2

clear all;
close all
clc

data = load('ex2data2.txt');
x = data(:,1:2);
y = data(:,3);

% Plot Original Data
figure,
positive = find(y==1);
negtive = find(y==0);
subplot(1,2,1);
hold on
plot(x(positive,1), x(positive,2), 'k+', 'LineWidth',2, 'MarkerSize', 7);
plot(x(negtive,1), x(negtive,2), 'bo', 'LineWidth',2, 'MarkerSize', 7);

% Compute Likelihood(Cost) Function
[m,n] = size(x);
x = mapFeature(x);
theta = zeros(size(x,2), 1);
lambda = 1;
[cost, grad] = cost_func(theta, x, y, lambda);
threshold = 0.53;
alpha = 10^(-1);
costs = [];
while cost > threshold
    theta = theta + alpha * grad;
    [cost, grad] = cost_func(theta, x, y, lambda);
    costs = [costs cost];
end

% Plot Decision Boundary 
hold on
plotDecisionBoundary(theta, x, y);
legend('Positive', 'Negtive', 'Decision Boundary')
xlabel('Feature Dim1');
ylabel('Feature Dim2');
title('Classifaction Using Logistic Regression');

% Plot Costs Iteration
% figure,
subplot(1,2,2);plot(costs, '*');
title('Cost Function Iteration');
xlabel('Iterations');
ylabel('Cost Fucntion Value');

end

function f=mapFeature(x)
% Map features to high dimension
degree = 6;
f = ones(size(x(:,1)));  
for i = 1:degree  
    for j = 0:i  
        f(:, end+1) = (x(:,1).^(i-j)).*(x(:,2).^j);
    end  
end
end

function g=sigmoid(z)
g = 1.0 ./ (1.0+exp(-z));
end

function [J,grad] = cost_func(theta, X, y, lambda)
% Computer Likelihood Function and Gradient
m = length(y); % training examples
hx = sigmoid(X*theta);
J = (1./m)*sum(-y.*log(hx)-(1.0-y).*log(1.0-hx)) + (lambda./(2*m)*norm(theta(2:end))^2);
regularize = (lambda/m).*theta;
regularize(1) = 0;
grad = (1./m) .* X' * (y-hx) - regularize;
end

function plotDecisionBoundary(theta, X, y)
%PLOTDECISIONBOUNDARY Plots the data points X and y into a new figure with
%the decision boundary defined by theta
%   PLOTDECISIONBOUNDARY(theta, X,y) plots the data points with + for the 
%   positive examples and o for the negative examples. X is assumed to be 
%   a either 
%   1) Mx3 matrix, where the first column is an all-ones column for the 
%      intercept.
%   2) MxN, N>3 matrix, where the first column is all-ones

% Plot Data
% plotData(X(:,2:3), y);
hold on

if size(X, 2) <= 3
    % Only need 2 points to define a line, so choose two endpoints
    plot_x = [min(X(:,2))-2,  max(X(:,2))+2];

    % Calculate the decision boundary line
    plot_y = (-1./theta(3)).*(theta(2).*plot_x + theta(1));

    % Plot, and adjust axes for better viewing
    plot(plot_x, plot_y)
    
    % Legend, specific for the exercise
    legend('Admitted', 'Not admitted', 'Decision Boundary')
    axis([30, 100, 30, 100])
else
    % Here is the grid range
    u = linspace(-1, 1.5, 50);
    v = linspace(-1, 1.5, 50);

    z = zeros(length(u), length(v));
    % Evaluate z = theta*x over the grid
    for i = 1:length(u)
        for j = 1:length(v)
            z(i,j) = mapFeature([u(i), v(j)])*theta;
        end
    end
    z = z'; % important to transpose z before calling contour

    % Plot z = 0
    % Notice you need to specify the range [0, 0]
    contour(u, v, z, [0, 0], 'LineWidth', 2)
end
end

我們再回過頭來看Logistic問題:對於非線性的問題,只不過使用了一個叫Sigmoid的非線性對映成一個線性問題。那麼,除了Sigmoid函式,還有其它的函式可用嗎?Andrew Ng老師還講了指數函式族。