R語言與迴歸分析學習筆記（bootstrap method）

阿新 • • 發佈：2019-01-09

Bootstrap方法在之前的博文《R語言與點估計學習筆記（EM演算法與Bootstrap法）》裡有提到過，簡而言之，bootstrap方法就是重抽樣。為什麼需要bootstrap方法呢？因為bootstrap方法使得我們無需分佈理論的知識也可以進行假設檢驗，獲得置信區間。當資料來自未知分佈，或者存在嚴重異常點，又或者樣本量過小，沒有引數方法解決問題時，bootstrap方法將是一個很棒的方法。

對於迴歸分析而言，bootstrap無疑對迴歸的正態性假設做了極大地放鬆，使得迴歸推斷越來越好用，也更具有說服力。

從博文《R語言與點估計學習筆記（EM演算法與Bootstrap法）》

裡可以看到，對於引數統計，特別是在已知分佈的引數估計，bootstrap並沒有多大的意義，它的結果和矩估計或者極大似然估計的結果並沒有多大的差別（如果有差別會令人覺得很奇怪，不是嗎？）

Boot包中提供了做bootstrap的兩個十分好用的函式：boot（），boot.ci（）。兩者的呼叫格式與引數說明如下：

Boot（）函式：

boot(data, statistic, R, sim ="ordinary", stype = c("i", "f", "w"),

strata= rep(1,n), L = NULL, m = 0, weights = NULL,

ran.gen = function(d, p) d, mle = NULL, simple = FALSE, ...,

parallel = c("no", "multicore", "snow"),

ncpus = getOption("boot.ncpus", 1L), cl = NULL)

引數說明：

Data：資料，可以是向量，矩陣，資料框

Statistic：統計量，如均值，中位數，迴歸引數，迴歸裡的R^2等

R：呼叫統計量函式次數

Boot（）的返回值：

T0:從原始資料中得到的k個統計量的觀測值

T:一個R*K的矩陣

Boot.ci（）函式：

boot.ci(boot.out, conf = 0.95, type = "all",

        index = 1:min(2,length(boot.out$t0)), var.t0 = NULL,

        var.t = NULL, t0 = NULL, t = NULL, L = NULL,

        h = function(t) t, hdot = function(t) rep(1,length(t)),

        hinv = function(t) t, ...)

引數說明：

Boot.out（）：boot函式的返回值

Type：返回置信區間的型別，R中提供的有"norm" ,"basic", "stud","perc", "bca"

一、對單個統計量使用bootstrap方法
我們以R中的資料集women為例說明這個問題。資料集women列出了美國婦女的平均身高和體重。以體重為響應變數，身高為協變數進行迴歸，獲取斜率項的95%置信區間。

R可以通過以下程式碼告訴我們答案：

library(boot)
beta<-function(formula,data,indices){
d<-data[indices,]
fit<-lm(formula,data=d)
return(fit$coef[2])
}
result<-boot(data=women,statistic=beta,R=500,formula=weight~height)
boot.ci(result)

輸出結果：

BOOTSTRAPCONFIDENCE INTERVAL CALCULATIONS

Basedon 500 bootstrap replicates

CALL:

boot.ci(boot.out= result)

Intervals:

Level Normal Basic

95% ( 3.218, 3.686 ) ( 3.231, 3.704 )

Level Percentile BCa

95% ( 3.196, 3.669 ) ( 3.199, 3.675 )

Calculationsand Intervals on Original Scale

他們與傳統的估計差別大嗎？我們來看看傳統的區間估計：

confint(lm(weight~height,data=women))

輸出結果：

2.5 % 97.5 %

(Intercept) -100.342655 -74.690679

height 3.253112 3.646888

可以看出，差別並不是很大，究其原因，無外乎正態性得到了很好的滿足。我們看其qq圖：

很清楚也很顯然。Shapiro檢驗也說明了這樣一個事實：

Shapiro-Wilk normality test

data: women$weight

W = 0.9604,p-value = 0.6986

我們在來看一個差別較大的例子：

. 考慮R中的資料集faithful。以waiting為響應變數，eruptions為協變數，建立簡單迴歸模型y=α+βx+e。考慮β的95%置信區間。

重複上面的步驟，R程式碼如下：

result<-boot(data=faithful,statistic=beta,R=500,formula=waiting~eruptions)
boot.ci(result)
confint(lm(waiting~eruptions,data=faithful))
qqPlot(lm(waiting~eruptions,data=faithful))

輸出結果：

BOOTSTRAPCONFIDENCE INTERVAL CALCULATIONS

Based on 500bootstrap replicates

CALL :

boot.ci(boot.out= result)

Intervals :

Level Normal Basic

95% (10.08, 11.30 ) (10.06, 11.26 )

Level Percentile BCa

95% (10.20, 11.40 ) (10.13, 11.35 )

Calculationsand Intervals on Original Scale

Some BCaintervals may be unstable

傳統估計：

2.5 % 97.5 %

(Intercept)31.20069 35.74810

eruptions 10.10996 11.34932

差別有些大，我們來看看qq圖：

正態性不是很好，shapiro檢驗告訴我們幾乎不可能認為是正態的。

Shapiro-Wilk normality test

data: faithful$waiting

W = 0.9221,p-value = 1.016e-10

觀察下圖，我們也可以看到waiting不服從正態分佈，那麼他的95%的置信區間可以通過以下程式碼獲得：

mean1<-function(data,indices){
d<-data[indices,]
fit<-mean(d$waiting)
return(fit)
}
results<-boot(data=faithful,statistic=mean1,R=1000)
boot.ci(results)

輸出結果：

BOOTSTRAPCONFIDENCE INTERVAL CALCULATIONS

Based on 1000bootstrap replicates

CALL :

boot.ci(boot.out= results)

Intervals :

Level Normal Basic

95% (69.28, 72.50 ) (69.31, 72.53 )

Level Percentile BCa

95% (69.26, 72.49 ) (69.10, 72.44 )

Calculationsand Intervals on Original Scale

與傳統方法比較

t.test(faithful$waiting)$conf.int

[1] 69.2741872.51994

可見估計的穩健性。

二、對多個統計量使用bootstrap方法

首先，建立一個返回迴歸係數向量的函式：

betas<-function(formula,data,indices){
d<-data[indices,]
fit<-lm(formula,data=d)
return(fit$coef)
}

然後自助抽樣500次：

results<-boot(data=states,statistic=betas,R=500,formula=Murder~.)
print(results)

輸出結果：

ORDINARYNONPARAMETRIC BOOTSTRAP

Call:

boot(data= states, statistic = betas, R = 500, formula = Murder ~ .)

BootstrapStatistics :

original bias std. error

t1* 1.2345634112 8.969206e-01 5.305460e+00

t2* 0.0002236754 2.345369e-06 8.884204e-05

t3* 4.1428365903 -1.621533e-01 8.333313e-01

t4* 0.0000644247 -1.131896e-04 9.750587e-04

t5* 0.0005813055 -1.928938e-03 1.055004e-02

當對多個統計量自助抽樣時，需要新增一個索引引數，指明plot（），boot.ci()函式所分析物件。如下列程式碼用於繪製人口結果：

plot(results,index=2)

輸出結果：

95%置信區間：

boot.ci(results,type="bca",index=2)
boot.ci(results,type="bca",index=3)
boot.ci(results,type="bca",index=4)

輸出結果：

>boot.ci(results,type="bca",index=2)

BOOTSTRAPCONFIDENCE INTERVAL CALCULATIONS

Basedon 500 bootstrap replicates

CALL:

boot.ci(boot.out= results, type = "bca", index = 2)

Intervals:

Level BCa

95% ( 0.0001, 0.0004 )

Calculationsand Intervals on Original Scale

>boot.ci(results,type="bca",index=3)

BOOTSTRAPCONFIDENCE INTERVAL CALCULATIONS

Basedon 500 bootstrap replicates

CALL:

boot.ci(boot.out= results, type = "bca", index = 3)

Intervals:

Level BCa

95% ( 2.284, 5.587 )

Calculationsand Intervals on Original Scale

SomeBCa intervals may be unstable

>boot.ci(results,type="bca",index=4)

BOOTSTRAPCONFIDENCE INTERVAL CALCULATIONS

Basedon 500 bootstrap replicates

CALL:

boot.ci(boot.out= results, type = "bca", index = 4)

Intervals:

Level BCa

95% (-0.0022, 0.0016 )

Calculationsand Intervals on Original Scale

三、殘差法。

Boot包中給出的辦法都是使用成對的bootstrap。結果較為穩健。但我們也可以使用殘差法對有固定水平的預測變數做估計。

演算法如下：

1、先由觀測資料擬合迴歸模型

2、然後獲得響應與殘差

3、從擬合殘差集合中有放回隨機抽取得到bootstrap殘差集合

4、生成一個偽資料集，對x迴歸偽資料集，獲得bootstrap引數估計beta

5、重複多次，得到beta的經驗分佈，做推斷或估計

我以women資料集為例，利用殘差法說明迴歸係數顯著不為0.R程式碼如下：

lm.reg<-lm(weight~height,data=women)
y.fit<-predict(lm.reg)
y.res<-residuals(lm.reg)
y.bootstrap<-rep(1,15)
datap<-rep(0,100)
dataq<-rep(0,100)
for(pin 1:100){
for(iin 1:15){
res<-sample(y.res,1,replace=TRUE)
y.bootstrap[i]<-y.fit[i]+res
}
datap[p]<-lm(y.bootstrap~women$height)$coef[1]
dataq[p]<-lm(y.bootstrap~women$height)$coef[2]
}
ecdf(datap)
ecdf(dataq)

輸出的結果

ecdf(datap)#x[1:100]= -104.56, -99.192, -97.884, ...,-75.389, -72.604

ecdf(dataq)#x[1:100]= 3.2271, 3.2588, 3.2845, ..., 3.625, 3.7125

顯然結論是對的，其95%的置信區間使用分位數法有：

A<-ecdf(datap)
B<-ecdf(dataq)
quantile(A,probs=c(0.025,0.975))
quantile(B,probs=c(0.025,0.975))

輸出結果：

> quantile(A,probs=c(0.025,0.975))

2.5% 97.5%

-96.95054 -75.22924

> quantile(B,probs=c(0.025,0.975))

2.5% 97.5%

3.264777 3.595554

這次介紹遺留了兩個有價值的問題：1、初始樣本多大？2、應該重複多少次？第一個問題，沒有簡單的答案，但有一點是肯定的，bootstrap不會提供比初始樣本更多的資訊，初始資訊的收集仍舊十分重要。第二個問題在計算機資源普及的今天，多做一些沒有太大的壞處，但是monte carlo方法及方差縮減技術將是你需要學習的。

R語言與迴歸分析學習筆記（bootstrap method）

R語言與迴歸分析學習筆記（bootstrap method）

R語言與點估計學習筆記（刀切法與最小二乘估計）

R語言與點估計學習筆記（EM演算法與Bootstrap法）

R語言與時間序列學習筆記（1）

R語言與點估計學習筆記（矩估計與MLE）

R語言與時間序列學習筆記（2）

c&c++反彙編與逆向分析學習筆記（2）--反彙編靜態分析工具IDA

二叉搜尋樹的原始碼分析學習筆記（基礎版）

應用統計學與R語言實現學習筆記（十一）——判別分析

R語言與機器學習學習筆記（分類演算法）（6）logistic迴歸

R語言學習筆記（十七）：data.table包中melt與dcast函數的使用

R語言與機器學習學習筆記（分類演算法）（1）K-近鄰演算法

R語言與機器學習學習筆記（分類演算法）（3）樸素貝葉斯

R語言與機器學習學習筆記（分類演算法）（2）決策樹演算法

基於R語言的Kaggle案例分析學習筆記（五）

應用統計學與R語言實現學習筆記（十四）——案例與實踐

R語言學習筆記（十一）：廣義線性模型

R語言學習筆記（十六）：處理缺失值

Python學習筆記（十二）：lambda表示式與函數語言程式設計

實變函式與泛函數分析學習筆記（二）：賦範線性空間

R語言與迴歸分析學習筆記（bootstrap method）

相關推薦