1. 程式人生 > >機器學習 - 特征篩選與降維

機器學習 - 特征篩選與降維

技術分享 eve table for posit none linear osi proc

特征決定了最優效果的上限,算法與模型只是讓效果更逼近這個上限,所以特征工程與選擇什麽樣的特征很重要!

以下是一些特征篩選與降維技巧

技術分享圖片
# -*- coding:utf-8 -*-
import scipy as sc
import libsvm_file_process as data_process
import numpy as np
from minepy import MINE
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2 from sklearn.feature_selection import f_regression from sklearn.feature_selection import RFE from sklearn.svm import SVR from sklearn.linear_model import LogisticRegression from sklearn.decomposition import PCA from sklearn.discriminant_analysis import
LinearDiscriminantAnalysis class feature_select: """ 特征篩選方式: 相關鏈接:http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection 皮爾遜相關性 互信息 單因素 - 卡方判斷,F值,假正率 方差過濾 遞歸特征消除法 - 每次消除一個特征,依據是特征前面的系數 基於模型(LR/GBDT等)的特征選擇 SelectFromModel 模型(LR/GBDT)必須有feature_importances_ 或 coef_這個屬性 降維: PCA(unsurperised):一般用於無監督情況下的降維,有監督的時候,也可以小幅降維 去除噪音,然後再使用LDA 降維 LDA(surperised):本質上是一個分類器,在使用上,要求降低的維度要小於分類的維度
""" def __init__(self): self.data_path = "/trainData/libsvm2/" self.trainData = ["20180101"] # 計算互信息 self.mine = MINE(alpha=0.6, c=15, est="mic_approx") # 方差過濾 一般用於無監督學習 self.variance_filter = VarianceThreshold(threshold=0.1) # chi2 - 卡方檢驗; f_regression - f值; SelectFpr-假正率;等 self.chi_squared = SelectKBest(f_regression, k=2) # 遞歸特征消除 self.estimator = LogisticRegression() # SVR(kernel="linear") self.selector = RFE(self.estimator, 5, step=1) # PCA 降維 self.pca = PCA(n_components=5) # LDA 降維 self.lda = LinearDiscriminantAnalysis(n_components=2) def select(self): for i in range(len(self.trainData)): generator = data_process.get_data_batch(self.data_path + self.trainData[i] + "/part-00000", 100000) labels, features = generator.next() # 方差過濾 filter1 = self.variance_filter.fit_transform(features) print filter1.shape, features.shape print self.variance_filter.get_support() # 卡方檢驗 filter2 = self.chi_squared.fit_transform(features, labels) print filter2.shape print self.chi_squared.get_support() # 遞歸特征消除(比較耗時 暫時先註釋掉) # self.selector.fit(features, labels) # print self.selector.support_ # PCA 降維 transform1 = self.pca.fit_transform(features) print transform1:, transform1 # LDA降維 self.lda.fit(features, labels) transform2 = self.lda.transform(features) print transform2:, transform2 for j in range(int(features.shape[1]) - 870): features_j = features[0:, j + 870: j + 871] self.mine.compute_score(features_j.flatten(), labels.flatten()) # 計算互信息 print self.mine.mic() # 計算皮爾遜系數 print j, sc.stats.pearsonr(features_j.reshape(-1, 1), labels.reshape(-1, 1)) if __name__ == __main__: feature_util = feature_select() feature_util.select()
View Code

機器學習 - 特征篩選與降維