1. 程式人生 > >大資料生物資訊學特徵選擇方法:基於搜尋的視角

大資料生物資訊學特徵選擇方法:基於搜尋的視角

#引用

##LaTex

@article{WANG201621, title = “Feature selection methods for big data bioinformatics: A survey from the search perspective”, journal = “Methods”, volume = “111”, pages = “21 - 31”, year = “2016”, note = “Big Data Bioinformatics”, issn = “1046-2023”, doi = “https://doi.org/10.1016/j.ymeth.2016.08.014”, url = “

http://www.sciencedirect.com/science/article/pii/S1046202316302742”, author = “Lipo Wang and Yaoli Wang and Qing Chang”, keywords = “Biomarkers, Classification, Clustering, Computational biology, Computational intelligence, Data mining, Evolutionary computation, Evolutionary algorithms, Fuzzy logic, Genetic algorithms, Machine learning, Microarray, Neural networks, Particle swarm optimization, Pattern recognition, Random forests, Rough sets, Soft computing, Swarm intelligence, Support vector machines” }

##Normal

Lipo Wang, Yaoli Wang, Qing Chang, Feature selection methods for big data bioinformatics: A survey from the search perspective, Methods, Volume 111, 2016, Pages 21-31, ISSN 1046-2023, https://doi.org/10.1016/j.ymeth.2016.08.014. (http://www.sciencedirect.com/science/article/pii/S1046202316302742) Keywords: Biomarkers; Classification; Clustering; Computational biology; Computational intelligence; Data mining; Evolutionary computation; Evolutionary algorithms; Fuzzy logic; Genetic algorithms; Machine learning; Microarray; Neural networks; Particle swarm optimization; Pattern recognition; Random forests; Rough sets; Soft computing; Swarm intelligence; Support vector machines

#摘要

大資料生物資訊學

特徵選擇 應用

傳統分類:

  1. filter
  2. wrapper
  3. embedded

新分類:(看做組合優化/搜尋問題)

  1. exhaustive search 窮舉搜尋
  2. heuristic search 啟發式搜尋 — 有或沒有資料提取特徵排序方法
  3. hybrid methods 混合方法

#1 真正最優特徵選擇:窮舉搜尋

分類器:

  1. random forests 隨機森林
  2. support vector machines (SVMs) 支援向量機
  3. cluster-oriented ensemble classifiers 面向簇的整合分類器
  4. random vector functional link (RVFL) 隨機向量泛函鏈
  5. radial basis function (RBF) neural networks 徑向基函式(RBF)神經網路

真正最優的特徵子集的搜尋 — 計算昂貴 — NP-難

窮盡所有可能的特徵組合

‘‘combinatorial explosion” — “組合爆炸” — 指數地

原始特徵數>30>30 — impossible

#2 次優特徵選擇:啟發式搜尋

‘‘Heuristic search” — “啟發式搜尋”:由“經驗”或“明智選擇”指導,期望找到好的次優甚至全域性最優。

優於隨機搜尋

演算法的必要成分:

  1. 區域性改進
  2. 創新

simulated annealing 模擬退火演算法 — 有一定概率接收較差解 — 有助於跳出區域性最優

genetic algorithm (GA) 遺傳演算法 ant colony optimization (ACC) 蟻群演算法 particle swarm optimization (PSO) 粒子群優化 chaotic simulated annealing 混沌模擬退火 tabu search 禁忌搜尋 noisy chaotic simulated annealing 噪聲混沌模擬退火 branch-and-bound 分支定界

##A 無資料提取特徵重要性排序的基於啟發式搜尋的特徵選擇

二進位制向量 — 是否選擇相應特徵 The nearest neighbor classifier 最近鄰分類器 case-based reasoning 案例推理 a leave-one-out procedure 留一法 succinct rules 簡明規則 silhouette statistics 輪廓統計 microarray 微陣列 peak tree 峰值樹 輸入權重 — SVM 或 神經網路 – embedded — 特徵重要性排序(非直接來自於資料) 權重統計分析 K-means + SVM margin influence analysis (MIA) 邊際影響分析 + SVM Mann–Whitney U test — nonparametric test method 非引數檢驗法 + no distribution-related assumptions 無分佈相關的假設 在混合描述符空間 Blocking — 模組化 聚合多種學習演算法的輸出 — 評估基因子集 — 效果明顯提升 — 獨立於使用的分類演算法 Quantitative structure–activity relationships (QSARs) 定量結構-活性關係(QSARS):biological activities of chemical compounds 化合物的生物活性 + their physicochemical descriptors 它們的物理化學描述符 lexico-semantic event structures 詞彙語義事件結構 a noun argument structure 名詞論據結構 corpus 語料庫 SRL系統 nonparallel plane proximal classifiers 非平行平面近似分類器 SVM + LpL_p正則化 — 高維 The support feature machine (SFM) 支援特徵機 fuzzy-rough sets 模糊粗糙集

feature evaluation criteria 特徵評價標準:

  1. dependency
  2. relevance
  3. redundancy
  4. significance

the signal-to-noise ratio (SNR)

a Laplace naive Bayes model — Laplace樸素貝葉斯模型 Laplace distribution — normal distribution 拉普拉斯分佈——正態分佈

Array comparative genomic hybridization (aCGH) 陣列比較基因組雜交

V. Metsis, F. Makedon, D. Shen, H. Huang, Dna copy number selection using robust structured sparsity-inducing norms, IEEE/ACM Trans. Comput. Biol. Bioinf. 11 (1) (2014) 168–181, http://dx.doi.org/10.1109/TCBB.2013.141.

##B 資料提取特徵重要性排序的貪婪搜尋

先評估每個特徵的重要性

一個對於某種分類器最好的特徵子集不見得對另一個效果好。

重要性度量:(直接由輸入資料得出)

  1. t-test
  2. fold-change difference
  3. Z-score
  4. Pearson correlation coefficient
  5. relative entropy
  6. mutual information
  7. separability-correlation measure
  8. feature relevance
  9. label changes produced by each feature
  10. information gain

維度約簡方法:

  • class-separability measure
  • Fisher ratio
  • principal components analysis (PCA)
  • t-test

4種特徵選擇(feature Selection,FS)方法:

  • t-test
  • significance analysis of microarrays (SAM)
  • rank products (RP)
  • random forest (RF)

#3 混合特徵選擇技術

##A 半窮舉搜尋

1 挑選一些重要特徵

  • 特徵重要性排序測度
  • Fisher-Markov selector
  • equal-width discretization scheme
  • 多種傳統統計方法的集合
  • high predictive power 2 利用較少特徵進行進一步搜尋
  • exhaustive search
  • Multi-objective optimization
  • an embedded GA, Tabu Search (TS), and SVM
  • graph optimization model

##B 其他混合特徵選擇方法

特徵提取方法

spectral biclustering sparse component analysis Poisson model scatter matrix singular value decomposition weighted PCA robust principal component analysis linear discriminant analysis Laplacian linear discriminant analysis (LLDA) Laplacian score SVD-entropy nonnegative matrix factorization (NMF) sparse NMF (SNMF) artificial neural network classification scheme

#4 總結與展望

大資料生物資訊學

##A 小樣本問題

維度(基因)非常高 — >20,000 樣本大小 — ~50個病人

overfitting and overoptimism — 過擬合和過優化

##B 非平衡資料

各個類別的資料數目不一

up-sampling classes with fewer data, down-sampling classes with more data 上取樣帶較少資料的類,下采樣具有更多資料的類

making classification errors sensitive to classes (cost-sensitive learning) 使分類錯誤對類敏感(成本敏感的學習)

signal-to-noise correlation coefficient (S2N) Feature Assessment by Sliding Thresholds (FAST)

empirical mutual information — the data sparseness issue

multivariate normal distributions

##C 類相關特徵選擇

每個類選擇不同的特徵子集

class-independent FS class-dependent FS

class distributions RBF neural classifier — the clustering property GA SVM the multi-layer perceptron (MLP) neural network the probability density function (PDF) projection theorem principle component analysis (PCA) from class-specific subspaces

a C-class classification problem — C 2-class classifiers

feature importance measures:

  • RELIEF
  • class separability
  • minimal-redundancy-maximal-relevancy

full class relevant (FCR) and partial class relevant (PCR) features

Markov blanket

multiclass ranking statistics class-specific statistics Pareto-front — alleviates the bias F-score and KW-score

a binary tree of simpler classification subproblems

feature subsets of every class