1. 程式人生 > >機器學習(一): python三種特徵選擇方法

機器學習(一): python三種特徵選擇方法

特徵選擇的三種方法介紹:

  1. 過濾型:

    選擇與目標變數相關性較強的特徵。缺點:忽略了特徵之間的關聯性。

  2. 包裹型:

    基於線性模型相關係數以及模型結果AUC逐步剔除特徵。如果剔除相關係數絕對值較小特徵後,AUC無大的變化,或降低,則可剔除

  3. 嵌入型:

    利用模型提取特徵,一般基於線性模型與正則化(正則化取L1),取權重非0的特徵。(特徵緯度特別高,特別稀疏,用svd,pca算不動)

python 實現

"""1.過濾型"""
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from
sklearn.feature_selection import chi2 iris=load_iris() X,y=iris.data,iris.target print X.shape X_new=SelectKBest(chi2,k=2).fit_transform(X,y) print X_new.shape """輸出: (150L, 4L) (150L, 2L)""" """2.包裹型""" from sklearn.feature_selection import RFE from sklearn.linear_model import LinearRegression from
sklearn.datasets import load_boston boston=load_boston() X=boston["data"] Y=boston["target"] names=boston["feature_names"] lr=LinearRegression() rfe=RFE(lr,n_features_to_select=1)#選擇剔除1個 rfe.fit(X,Y) print "features sorted by their rank:" print sorted(zip(map(lambda x:round(x,4), rfe.ranking_),names)) """輸出:按剔除後AUC排名給出 features sorted by their rank: [(1.0, 'NOX'), (2.0, 'RM'), (3.0, 'CHAS'), (4.0, 'PTRATIO'), (5.0, 'DIS'), (6.0, 'LSTAT'), (7.0, 'RAD'), (8.0, 'CRIM'), (9.0, 'INDUS'), (10.0, 'ZN'), (11.0, 'TAX') , (12.0, 'B'), (13.0, 'AGE')]"""
"""3.嵌入型 ,老的版本沒有SelectFromModel""" from sklearn.svm import LinearSVC from sklearn.datasets import load_iris from sklearn.feature_selection import SelectFromModel iris=load_iris() X,y=iris.data,iris.target print X.shape lsvc=LinearSVC(C=0.01,penalty='l1',dual=False).fit(X,y) model=SelectFromModel(lsvc,prefit=True) X_new=model.transform(X) print X_new.shape """輸出: (150,4) (150,3) """