大資料生物資訊學特徵選擇方法：基於搜尋的視角

阿新 • • 發佈：2018-12-20

#引用

##LaTex

@article{WANG201621, title = “Feature selection methods for big data bioinformatics: A survey from the search perspective”, journal = “Methods”, volume = “111”, pages = “21 - 31”, year = “2016”, note = “Big Data Bioinformatics”, issn = “1046-2023”, doi = “https://doi.org/10.1016/j.ymeth.2016.08.014”, url = “

http://www.sciencedirect.com/science/article/pii/S1046202316302742”, author = “Lipo Wang and Yaoli Wang and Qing Chang”, keywords = “Biomarkers, Classification, Clustering, Computational biology, Computational intelligence, Data mining, Evolutionary computation, Evolutionary algorithms, Fuzzy logic, Genetic algorithms, Machine learning, Microarray, Neural networks, Particle swarm optimization, Pattern recognition, Random forests, Rough sets, Soft computing, Swarm intelligence, Support vector machines” }

##Normal

Lipo Wang, Yaoli Wang, Qing Chang, Feature selection methods for big data bioinformatics: A survey from the search perspective, Methods, Volume 111, 2016, Pages 21-31, ISSN 1046-2023, https://doi.org/10.1016/j.ymeth.2016.08.014. (http://www.sciencedirect.com/science/article/pii/S1046202316302742) Keywords: Biomarkers; Classification; Clustering; Computational biology; Computational intelligence; Data mining; Evolutionary computation; Evolutionary algorithms; Fuzzy logic; Genetic algorithms; Machine learning; Microarray; Neural networks; Particle swarm optimization; Pattern recognition; Random forests; Rough sets; Soft computing; Swarm intelligence; Support vector machines

#摘要

大資料生物資訊學

特徵選擇應用

傳統分類：

filter
wrapper
embedded

新分類：（看做組合優化/搜尋問題）

exhaustive search 窮舉搜尋
heuristic search 啟發式搜尋 — 有或沒有資料提取特徵排序方法
hybrid methods 混合方法

#1 真正最優特徵選擇：窮舉搜尋

分類器：

random forests 隨機森林
support vector machines (SVMs) 支援向量機
cluster-oriented ensemble classifiers 面向簇的整合分類器
random vector functional link (RVFL) 隨機向量泛函鏈
radial basis function (RBF) neural networks 徑向基函式（RBF）神經網路

真正最優的特徵子集的搜尋 — 計算昂貴 — NP-難

窮盡所有可能的特徵組合

‘‘combinatorial explosion” — “組合爆炸” — 指數地

原始特徵數 $>30$ — impossible

#2 次優特徵選擇：啟發式搜尋

‘‘Heuristic search” — “啟發式搜尋”：由“經驗”或“明智選擇”指導，期望找到好的次優甚至全域性最優。

優於隨機搜尋

演算法的必要成分：

區域性改進
創新

simulated annealing 模擬退火演算法 — 有一定概率接收較差解 — 有助於跳出區域性最優

genetic algorithm (GA) 遺傳演算法 ant colony optimization (ACC) 蟻群演算法 particle swarm optimization (PSO) 粒子群優化 chaotic simulated annealing 混沌模擬退火 tabu search 禁忌搜尋 noisy chaotic simulated annealing 噪聲混沌模擬退火 branch-and-bound 分支定界

##A 無資料提取特徵重要性排序的基於啟發式搜尋的特徵選擇

二進位制向量 — 是否選擇相應特徵 The nearest neighbor classifier 最近鄰分類器 case-based reasoning 案例推理 a leave-one-out procedure 留一法 succinct rules 簡明規則 silhouette statistics 輪廓統計 microarray 微陣列 peak tree 峰值樹輸入權重 — SVM 或神經網路 – embedded — 特徵重要性排序（非直接來自於資料）權重統計分析 K-means + SVM margin influence analysis (MIA) 邊際影響分析 + SVM Mann–Whitney U test — nonparametric test method 非引數檢驗法 + no distribution-related assumptions 無分佈相關的假設在混合描述符空間 Blocking — 模組化聚合多種學習演算法的輸出 — 評估基因子集 — 效果明顯提升 — 獨立於使用的分類演算法 Quantitative structure–activity relationships (QSARs) 定量結構-活性關係（QSARS）：biological activities of chemical compounds 化合物的生物活性 + their physicochemical descriptors 它們的物理化學描述符 lexico-semantic event structures 詞彙語義事件結構 a noun argument structure 名詞論據結構 corpus 語料庫 SRL系統 nonparallel plane proximal classifiers 非平行平面近似分類器 SVM + $L_p$ 正則化 — 高維 The support feature machine (SFM) 支援特徵機 fuzzy-rough sets 模糊粗糙集

feature evaluation criteria 特徵評價標準：

dependency
relevance
redundancy
significance

the signal-to-noise ratio (SNR)

a Laplace naive Bayes model — Laplace樸素貝葉斯模型 Laplace distribution — normal distribution 拉普拉斯分佈——正態分佈

Array comparative genomic hybridization (aCGH) 陣列比較基因組雜交

V. Metsis, F. Makedon, D. Shen, H. Huang, Dna copy number selection using robust structured sparsity-inducing norms, IEEE/ACM Trans. Comput. Biol. Bioinf. 11 (1) (2014) 168–181, http://dx.doi.org/10.1109/TCBB.2013.141.

##B 資料提取特徵重要性排序的貪婪搜尋

先評估每個特徵的重要性

一個對於某種分類器最好的特徵子集不見得對另一個效果好。

重要性度量：（直接由輸入資料得出）

t-test
fold-change difference
Z-score
Pearson correlation coefficient
relative entropy
mutual information
separability-correlation measure
feature relevance
label changes produced by each feature
information gain

維度約簡方法：

class-separability measure
Fisher ratio
principal components analysis (PCA)
t-test

4種特徵選擇（feature Selection，FS）方法：

t-test
significance analysis of microarrays (SAM)
rank products (RP)
random forest (RF)

#3 混合特徵選擇技術

##A 半窮舉搜尋

1 挑選一些重要特徵

特徵重要性排序測度
Fisher-Markov selector
equal-width discretization scheme
多種傳統統計方法的集合
high predictive power 2 利用較少特徵進行進一步搜尋
exhaustive search
Multi-objective optimization
an embedded GA, Tabu Search (TS), and SVM
graph optimization model

##B 其他混合特徵選擇方法

特徵提取方法

spectral biclustering sparse component analysis Poisson model scatter matrix singular value decomposition weighted PCA robust principal component analysis linear discriminant analysis Laplacian linear discriminant analysis (LLDA) Laplacian score SVD-entropy nonnegative matrix factorization (NMF) sparse NMF (SNMF) artificial neural network classification scheme

#4 總結與展望

大資料生物資訊學

##A 小樣本問題

維度（基因）非常高 — >20,000 樣本大小 — ~50個病人

overfitting and overoptimism — 過擬合和過優化

##B 非平衡資料

各個類別的資料數目不一

up-sampling classes with fewer data, down-sampling classes with more data 上取樣帶較少資料的類，下采樣具有更多資料的類

making classification errors sensitive to classes (cost-sensitive learning) 使分類錯誤對類敏感（成本敏感的學習）

signal-to-noise correlation coefficient (S2N) Feature Assessment by Sliding Thresholds (FAST)

empirical mutual information — the data sparseness issue

multivariate normal distributions

##C 類相關特徵選擇

每個類選擇不同的特徵子集

class-independent FS class-dependent FS

class distributions RBF neural classifier — the clustering property GA SVM the multi-layer perceptron (MLP) neural network the probability density function (PDF) projection theorem principle component analysis (PCA) from class-specific subspaces

a C-class classification problem — C 2-class classifiers

feature importance measures：

RELIEF
class separability
minimal-redundancy-maximal-relevancy

full class relevant (FCR) and partial class relevant (PCR) features

Markov blanket

multiclass ranking statistics class-specific statistics Pareto-front — alleviates the bias F-score and KW-score

a binary tree of simpler classification subproblems

feature subsets of every class

大資料生物資訊學特徵選擇方法：基於搜尋的視角

大資料生物資訊學特徵選擇方法：基於搜尋的視角

《Python生物資訊學資料管理》高清中文版PDF+英文版PDF+原始碼學習

特徵選擇方法之資訊增益

生物資訊學常見的資料下載，包括基因組，gtf，bed，註釋

生物資訊學入門使用 GEO基因晶片資料進行差異表達分析（DEG）——Limma 演算法資料程式碼結果解讀

[大資料]Scala 速學手冊2

[大資料] Scala 速學手冊1

生物資訊學相關網站和部落格資源

解析京東大資料下高效影象特徵提取方案

[轉載]Scikit-learn介紹幾種常用的特徵選擇方法

機器學習特徵選擇方法

零基礎大資料學習必學技術有哪幾種，你知道嗎？

大資料要學習哪些技術呢？大資料技術的分類與選擇路線

電子科技大學生物資訊學重點

學大資料，要學多久？

生物資訊學資料庫資源 {#database}

大資料計算機資訊的處理技術

零基礎大資料學習必學技術有哪幾種？

0基礎大資料程式設計怎麼學？三個步驟+加一套完整學習體系教你入門

大資料入門怎麼學習好

大資料生物資訊學特徵選擇方法：基於搜尋的視角

相關推薦