1. 程式人生 > >【一週演算法實踐】__2.模型構建之整合模型

【一週演算法實踐】__2.模型構建之整合模型

模型構建之整合模型

構建RF GBDT XDBoost LightGBM這四個模型,並對每一個模型使用準確率和AUC評分。在上次任務中使用了LR SVM DecisionTree這三個簡單的模型對樣本進行了預測和評價,請參照https://blog.csdn.net/wxq_1993/article/details/85703936。

#1.匯入要使用的模組
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.ensemble import
RandomForestClassifier from sklearn.ensemble import GradientBoostingClassifier from lightgbm import LGBMClassifier from xgboost import XGBClassifier from sklearn.metrics import accuracy_score,roc_auc_score import time import warnings warnings.filterwarnings('ignore')
# 2.劃分X和y並簡單分析資料
data_original=
pd.read_csv("data_all.csv") data_original.head(5) data_original.describe() #data_original.info()
low_volume_percent middle_volume_percent take_amount_in_later_12_month_highest trans_amount_increase_rate_lately trans_activity_month trans_activity_day transd_mcc trans_days_interval_filter trans_days_interval regional_mobility ... consfin_product_count consfin_max_limit consfin_avg_limit latest_query_day loans_latest_day reg_preference_for_trad latest_query_time_month latest_query_time_weekday loans_latest_time_month loans_latest_time_weekday
count 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 ... 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.00000 4754.000000 4754.000000
mean 0.021801 0.901332 1940.197728 14.152318 0.804493 0.365356 17.503155 29.004628 21.748422 2.678797 ... 5.088347 16418.973496 7507.426378 24.041649 51.984013 0.372949 4.273875 3.42196 4.542701 3.025873
std 0.041519 0.144837 3923.971494 693.961441 0.196920 0.170194 4.474686 22.711659 16.472031 0.890198 ... 3.344794 13885.107357 5830.674623 36.500344 53.249364 0.687382 1.333778 1.93213 2.987731 1.895870
min 0.000000 0.000000 0.000000 0.000000 0.120000 0.033000 2.000000 0.000000 4.000000 1.000000 ... 0.000000 0.000000 0.000000 -2.000000 -2.000000 0.000000 1.000000 0.00000 1.000000 0.000000
25% 0.010000 0.880000 0.000000 0.620000 0.670000 0.233000 15.000000 16.000000 12.000000 2.000000 ... 3.000000 7800.000000 4200.000000 6.000000 7.000000 0.000000 4.000000 2.00000 3.000000 2.000000
50% 0.010000 0.960000 500.000000 0.970000 0.860000 0.350000 17.000000 23.000000 17.000000 3.000000 ... 4.000000 14400.000000 6750.000000 16.000000 29.000000 0.000000 4.000000 4.00000 4.000000 3.000000
75% 0.020000 0.990000 2000.000000 1.600000 1.000000 0.479500 20.000000 32.000000 26.750000 3.000000 ... 7.000000 20400.000000 9696.250000 23.000000 86.000000 1.000000 5.000000 5.00000 5.000000 5.000000
max 1.000000 1.000000 68000.000000 47596.740000 1.000000 0.941000 42.000000 285.000000 234.000000 5.000000 ... 20.000000 266400.000000 82800.000000 360.000000 323.000000 4.000000 12.000000 6.00000 12.000000 6.000000

8 rows × 85 columns

y=data_original['status'].copy()
X=data_original.drop(['status'],axis=1).copy()
print("the X shape is:", X.shape)
print("the X shape is:" ,y.shape)
print("the nums of label 1 in y are",len(y[y==1]))
print("the nums of label 0 in y are",len(y[y==0]))
df_ret=pd.DataFrame(columns=('Model','Accuracy','AUC','Time'))
row=0
the X shape is: (4754, 84)
the X shape is: (4754,)
the nums of label 1 in y are 1193
the nums of label 0 in y are 3561

一共有4754組資料,每組資料中有84個特徵;標籤值中為1的有1193個,為0的有3561個;正樣例與負樣例數量差別較大,在後續處理應當考慮。

#3.資料集的三七劃分
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2018)
print('the proportition of label 1 in y_test: %.2f%%'%(len(y_test[y_test==1])/len(y_test)*100))
the proportition of label 1 in y_test: 25.16%
# 4.定義一個評價函式
def evaluate(y_pre,y):
    acc=accuracy_score(y,y_pre)
    auc=roc_auc_score(y,y_pre)
    return acc,auc

由於在第一次作業中頻繁呼叫accuracy_score()和f1_score(),在第二次作業中,將其定義成一個評價函式方便呼叫

問題來了,我從官方文件上直接複製RF GBDT XGBoost Lightgbm這四個分類器的預設引數,執行後竟然報錯,提示有中文字元或者空格,只好如下這麼簡單輸入了

# 5.構建模型進行預測
#分別採用 RF GBDT XGBoost Lightgbm,由於對模型不熟悉,故全部採用預設值

rf_model=RandomForestClassifier(n_estimators=100,max_depth=None,criterion='gini')
gbdt_model=GradientBoostingClassifier(n_estimators=100,max_depth=3,learning_rate=0.1)
xgb_model=XGBClassifier(n_estimators=100,learning_rate=0.1,max_depth=3)
lgbm_model=LGBMClassifier(n_estimators=100,learning_rate=0.1,max_depth=-1)
# 6.訓練模型
models=[('RF',rf_model),('gbdt',gbdt_model),('xgb',xgb_model),('lgbm',lgbm_model)]
for name,model in models:
    print(name,'start training.....')
    startTime=time.clock()
    model.fit(X_train,y_train)
    y_pred=model.predict(X_test)
    endTime=time.clock()
    print(name,'using time is %.4f'%(endTime-startTime))
    acc,auc=evaluate(y_pred,y_test)
    print(name,'accuracy_score:',round(acc,4),'auc_score: ',round(auc,4))
    df_ret.loc[row]=[name,acc,auc,(endTime-startTime)]
    row+=1
    print('\n')
print(df_ret)
RF start training.....
RF using time is 1.3224
RF accuracy_score: 0.7849 auc_score:  0.6076


gbdt start training.....
gbdt using time is 1.3351
gbdt accuracy_score: 0.78 auc_score:  0.6376


xgb start training.....
xgb using time is 0.7749
xgb accuracy_score: 0.7856 auc_score:  0.6432


lgbm start training.....
lgbm using time is 0.7061
lgbm accuracy_score: 0.7701 auc_score:  0.631


  Model  Accuracy       AUC      Time
0    RF  0.784863  0.607558  1.322362
1  gbdt  0.779958  0.637566  1.335071
2   xgb  0.785564  0.643161  0.774934
3  lgbm  0.770147  0.631012  0.706147

根據結果可知,整合學習的這四種模型明顯好於第一次使用的三種模型,**其中XGBoost表現最好,LGBM速度最快;**由於複製預設引數報錯,導致訓練過程中只是用了三個引數,在後續的訓練中繼續改進。另外在面試過程中XGBoost和GBDT模型是經常被提問的,應當重點掌握。

參考資料:

1.整合模型
2.XGBoost:
3.RandomForest:
4.GradientBoostingClassifier:
5.xgboost的安裝:
6.https://zhuanlan.zhihu.com/p/54042675