資料預處理之資料抽樣

阿新 • • 發佈：2018-12-08

資料抽樣

在資料建模階段，一般需要將樣本分為3部分：訓練集、驗證集、測試集。訓練集用來估計模型，驗證集用來確定網路結構或者控制模型複雜度的引數，測試集檢驗最終選擇模型的效能如何。一般劃分為70%、15%、15%。當資料量較小時，留少部分作為測試集，把其餘N個樣本採用K折交叉驗證法。即將樣本打亂，均勻分K份，輪流選擇K-1份，剩餘的做驗證，計算預測誤差平方和，最後對K次的誤差平方和在做平均。
1、類失衡處理方法SMOTE

hyper <-read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.data',
                 header=F)
names <- read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.names', 
                  header=F, sep='\t')[[1]]
head(names)
names <- gsub(pattern = ":|[.]","",names) #刪除冒號、句號
colnames(hyper)<-names  #列重新命名
colnames(hyper)[1] <- "target"
hyper$target <- ifelse(hyper$target=="negative",0,1)  #因子替換為數值
table(hyper$target)    #分類統計
prop.table(table(hyper$target))  #分類比例，說明1明顯較少
hyper$target <- as.factor(hyper$target)
library("DMwR")
hyper_new <- SMOTE(target ~ .,hyper,perc.over = 100,perc.under = 200)
#perc.over 決定對少數派（假設為a)增加多少的引數百分比，這裡相當於增加1倍(b=a+a*100%,新增a)
#perc.under 決定對多數類取樣的比例引數，基準為新增少數派（a*100%），這裡相當於抽新增少數派的2倍(c=a*100%*200%)
table(hyper_new$target)

2、隨機抽樣sample

index <- sample(nrow(user),10000)
user_sample <- user[index,]
prop.table(table(user$是否付費))
prop.table(table(user_sample$是否付費))
ratio <- sum(user$是否付費==0)/nrow(user)
d <- 1:nrow(user)
index1 <- sample(d[user$是否付費==0],10000*ratio)
index2 <- sample(d[user$是否付費==1],10000*(1-ratio))
user_sam1 <- user[c(index1,index2),]
prop.table(table(user_sam1$是否付費))  與原比例一致

3、資料等比抽樣，createDataPartition

library("caret")
index <- createDataPartition(iris$Species,p=0.1,list = F)
sample<-iris[index,]
prop.table(table(iris$Species))

4、交叉驗證的抽樣

#構造6組資料
n = nrow(user)
d <- 1:nrow(user)
d2 <- rep(1:6,ceiling(n/6))
d2 <- sample(d2,n)
for(i in 1:6){
  m<-d[d2==i]
  train1 <- user[-m,]
  test1 <- user[m,]
}

另外整合函式createFolds(y,k=10,list=TRUE,returnTrain=FASLE)可以等比例達到效果，
createMultiFolds(y,k=10,times=5)

資料清洗

1、缺失值

palyer <- read.csv("玩家玩牌資料.csv",na.strings = "NA")
table(complete.cases(palyer))
library("mice")
md.pattern(palyer)

在這裡插入圖片描述

library("VIM")
aggr(palyer,prop = FALSE,numbers = TRUE)

在這裡插入圖片描述

player1 <- na.omit(palyer)
index1 <- is.na(palyer$玩牌局數)
a <- palyer[,"玩牌局數"]
palyer[index1,"玩牌局數"] <- ceiling(mean(palyer$玩牌局數,na.rm=T))

資料預處理之資料抽樣

資料抽樣

資料清洗

資料預處理之資料抽樣

機器學習小組知識點27：資料預處理之資料離散化（Data Discretization）

資料預處理之資料離散化

資料預處理之資料標準化

python資料預處理：資料抽樣

資料預處理之缺失值處理

Python資料預處理之---統計學的t檢驗，卡方檢驗以及均值，中位數等

python資料預處理之缺失值簡單處理，特徵選擇

機器學習 --2 特徵預處理之資料將維

機器學習 --2 特徵預處理之資料標準化

第1章-資料探索(3)-資料預處理之R實現

第1章-資料探索(2)-資料預處理之Python實現

資料預處理之獨熱編碼（One-Hot Encoding）

資料預處理之抽取文字資訊（2）

【資料探勘】【筆記】資料預處理之類別特徵編碼

資料預處理之定量特徵二值化與定性特徵啞變數編碼

資料預處理之將類別資料數字化的方法 —— LabelEncoder VS OneHotEncoder

資料預處理之歸一化

資料預處理之標準化

資料預處理之歸一化(normalization)

資料預處理之資料抽樣

資料抽樣

資料清洗

相關推薦