one hot 編碼及資料歸一化

阿新 • • 發佈：2019-01-06

問題由來

在很多機器學習任務中，特徵並不總是連續值，而有可能是分類值。

例如，考慮一下的三個特徵：

["male", "female"]

["from Europe", "from US", "from Asia"]

["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]

如果將上述特徵用數字表示，效率會高很多。例如：

["male", "from US", "uses Internet Explorer"] 表示為[0, 1, 3]

["female", "from Asia", "uses Chrome"]表示為[1, 2, 1]

但是，即使轉化為數字表示後，上述資料也不能直接用在我們的分類器中。因為，分類器往往預設資料資料是連續的，並且是有序的。但是，按照我們上述的表示，數字並不是有序的，而是隨機分配的。

獨熱編碼

為了解決上述問題，其中一種可能的解決方法是採用獨熱編碼（One-Hot Encoding）。

獨熱編碼即 One-Hot 編碼，又稱一位有效編碼，其方法是使用N位狀態暫存器來對N個狀態進行編碼，每個狀態都由他獨立的暫存器位，並且在任意時候，其中只有一位有效。

例如：

自然狀態碼為：000,001,010,011,100,101

獨熱編碼為：000001,000010,000100,001000,010000,100000

可以這樣理解，對於每一個特徵，如果它有m個可能值，那麼經過獨熱編碼後，就變成了m個二元特徵。並且，這些特徵互斥，每次只有一個啟用。因此，資料會變成稀疏的。

這樣做的好處主要有：

解決了分類器不好處理屬性資料的問題
在一定程度上也起到了擴充特徵的作用

舉例

我們基於Python和Scikit-learn寫一個簡單的例子：

from sklearn import preprocessing

enc = preprocessing.OneHotEncoder()

enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])

enc.transform([[0, 1, 3]]).toarray()

輸出結果：

array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])

處理離散型特徵和連續型特徵並存的情況，如何做歸一化。
參考部落格進行了總結：
https://www.quora.com/What-are-good-ways-to-handle-discrete-and-continuous-inputs-together
總結如下：
1、拿到獲取的原始特徵，必須對每一特徵分別進行歸一化，比如，特徵A的取值範圍是[-1000,1000]，特徵B的取值範圍是[-1,1].
如果使用logistic迴歸，w1*x1+w2*x2，因為x1的取值太大了，所以x2基本起不了作用。
所以，必須進行特徵的歸一化，每個特徵都單獨進行歸一化。
2、連續型特徵歸一化的常用方法：
   2.1：Rescale bounded continuous features: All continuous input that are bounded, rescale them to [-1, 1] through x = (2x - max - min)/(max - min).線性放縮到[-1,1]
  2.2:Standardize all continuous features: All continuous input should be standardized and by this I mean, for every continuous feature, compute its mean (u) and standard deviation (s) and do x = (x - u)/s.放縮到均值為0，方差為1
1、離散型特徵的處理方法：

a) Binarize categorical/discrete features: For all categorical features, represent them as multiple boolean features. For example, instead of having one feature called marriage_status, have 3 boolean features - married_status_single, married_status_married, married_status_divorced and appropriately set these features to 1 or -1. As you can see, for every categorical feature, you are adding k binary feature where k is the number of values that the categorical feature takes.對於離散的特徵基本就是按照one-hot編碼，該離散特徵有多少取值，就用多少維來表示該特徵。
為什麼使用one-hot編碼來處理離散型特徵，這是有理由的，不是隨便拍腦袋想出來的！！！具體原因，分下面幾點來闡述：
1、Why do we binarize categorical features?
We binarize the categorical input so that they can be thought of as a vector from the Euclidean space (we call this as embedding the vector in the Euclidean space).使用one-hot編碼，將離散特徵的取值擴充套件到了歐式空間，離散特徵的某個取值就對應歐式空間的某個點。

2、Why do we embed the feature vectors in the Euclidean space?
Because many algorithms for classification/regression/clustering etc. requires computing distances between features or similarities between features. And many definitions of distances and similarities are defined over features in Euclidean space. So, we would like our features to lie in the Euclidean space as well.將離散特徵通過one-hot編碼對映到歐式空間，是因為，在迴歸，分類，聚類等機器學習演算法中，特徵之間距離的計算或相似度的計算是非常重要的，而我們常用的距離或相似度的計算都是在歐式空間的相似度計算，計算餘弦相似性，基於的就是歐式空間。

3、Why does embedding the feature vector in Euclidean space require us to binarize categorical features?
Let us take an example of a dataset with just one feature (say job_type as per your example) and let us say it takes three values 1,2,3.
Now, let us take three feature vectors x_1 = (1), x_2 = (2), x_3 = (3). What is the euclidean distance between x_1 and x_2, x_2 and x_3 & x_1 and x_3? d(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2. This shows that distance between job type 1 and job type 2 is smaller than job type 1 and job type 3. Does this make sense? Can we even rationally define a proper distance between different job types? In many cases of categorical features, we can properly define distance between different values that the categorical feature takes. In such cases, isn't it fair to assume that all categorical features are equally far away from each other?
Now, let us see what happens when we binary the same feature vectors. Then, x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1). Now, what are the distances between them? They are sqrt(2). So, essentially, when we binarize the input, we implicitly state that all values of the categorical features are equally away from each other.
將離散型特徵使用one-hot編碼，確實會讓特徵之間的距離計算更加合理。比如，有一個離散型特徵，代表工作型別，該離散型特徵，共有三個取值，不使用one-hot編碼，其表示分別是x_1 = (1), x_2 = (2), x_3 = (3)。兩個工作之間的距離是，(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2。那麼x_1和x_3工作之間就越不相似嗎？顯然這樣的表示，計算出來的特徵的距離是不合理。那如果使用one-hot編碼，則得到x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1)，那麼兩個工作之間的距離就都是sqrt(2).即每兩個工作之間的距離是一樣的，顯得更合理。
4、About the original question?
Note that our reason for why binarize the categorical features is independent of the number of the values the categorical features take, so yes, even if the categorical feature takes 1000 values, we still would prefer to do binarization.
對離散型特徵進行one-hot編碼是為了讓距離的計算顯得更加合理。
5、Are there cases when we can avoid doing binarization?
Yes. As we figured out earlier, the reason we binarize is because we want some meaningful distance relationship between the different values. As long as there is some meaningful distance relationship, we can avoid binarizing the categorical feature. For example, if you are building a classifier to classify a webpage as important entity page (a page important to a particular entity) or not and let us say that you have the rank of the webpage in the search result for that entity as a feature, then 1] note that the rank feature is categorical, 2] rank 1 and rank 2 are clearly closer to each other than rank 1 and rank 3, so the rank feature defines a meaningful distance relationship and so, in this case, we don't have to binarize the categorical rank feature.

More generally, if you can cluster the categorical values into disjoint subsets such that the subsets have meaningful distance relationship amongst them, then you don't have binarize fully, instead you can split them only over these clusters. For example, if there is a categorical feature with 1000 values, but you can split these 1000 values into 2 groups of 400 and 600 (say) and within each group, the values have meaningful distance relationship, then instead of fully binarizing, you can just add 2 features, one for each cluster and that should be fine.
將離散型特徵進行one-hot編碼的作用，是為了讓距離計算更合理，但如果特徵是離散的，並且不用one-hot編碼就可以很合理的計算出距離，那麼就沒必要進行one-hot編碼，比如，該離散特徵共有1000個取值，我們分成兩組，分別是400和600,兩個小組之間的距離有合適的定義，組內的距離也有合適的定義，那就沒必要用one-hot 編碼

離散特徵進行one-hot編碼後，編碼後的特徵，其實每一維度的特徵都可以看做是連續的特徵。就可以跟對連續型特徵的歸一化方法一樣，對每一維特徵進行歸一化。比如歸一化到[-1,1]或歸一化到均值為0,方差為1

有些情況不需要進行特徵的歸一化：
It depends on your ML algorithms, some methods requires almost no efforts to normalize features or handle both continuous and discrete features, like tree based methods: c4.5, Cart, random Forrest, bagging or boosting. But most of parametric models (generalized linear models, neural network, SVM,etc) or methods using distance metrics (KNN, kernels, etc) will require careful work to achieve good results. Standard approaches including binary all features, 0 mean unit variance all continuous features, etc。
  基於樹的方法是不需要進行特徵的歸一化，例如隨機森林，bagging 和 boosting等。基於引數的模型或基於距離的模型，都是要進行特徵的歸一化。
one-hot編碼為什麼可以解決類別型資料的離散值問題
首先，one-hot編碼是N位狀態暫存器為N個狀態進行編碼的方式
eg：高、中、低不可分，→ 用0 0 0 三位編碼之後變得可分了，並且成為互相獨立的事件
→ 類似 SVM中，原本線性不可分的特徵，經過project之後到高維之後變得可分了
GBDT處理高維稀疏矩陣的時候效果並不好，即使是低維的稀疏矩陣也未必比SVM好
Tree Model不太需要one-hot編碼：
對於決策樹來說，one-hot的本質是增加樹的深度
tree-model是在動態的過程中生成類似 One-Hot + Feature Crossing 的機制
1. 一個特徵或者多個特徵最終轉換成一個葉子節點作為編碼，one-hot可以理解成三個獨立事件
2. 決策樹是沒有特徵大小的概念的，只有特徵處於他分佈的哪一部分的概念
one-hot可以解決線性可分問題但是比不上label econding
one-hot降維後的缺點：
降維前可以交叉的降維後可能變得不能交叉
樹模型的訓練過程：
從根節點到葉子節點整條路中有多少個節點相當於交叉了多少次，所以樹的模型是自行交叉
eg：是否是長的 { 否（是→ 柚子，否 → 蘋果），是 → 香蕉 } 園 cross 黃 → 形狀（圓，長）顏色（黃，紅） one-hot度為4的樣本
使用樹模型的葉子節點作為特徵集交叉結果可以減少不必要的特徵交叉的操作或者減少維度和degree候選集
eg 2 degree → 8的特徵向量樹 → 3個葉子節點
樹模型：Ont-Hot + 高degree笛卡爾積 + lasso 要消耗更少的計算量和計算資源
這就是為什麼樹模型之後可以stack線性模型
n*m的輸入樣本 → 決策樹訓練之後可以知道在哪一個葉子節點上 → 輸出葉子節點的index → 變成一個n*1的矩陣 → one-hot編碼 → 可以得到一個n*o的矩陣（o是葉子節點的個數） → 訓練一個線性模型
典型的使用： GBDT +　ＲＦ
優點：節省做特徵交叉的時間和空間
如果只使用one-hot訓練模型，特徵之間是獨立的
對於現有模型的理解：（G（l（張量）））：
其中：l（·）為節點的模型
G（·）為節點的拓撲方式
神經網路：l（·）取邏輯迴歸模型
G（·）取全連線的方式
決策樹： l（·）取LR
G（·）取樹形連結方式
創新點： l（·）取 NB，SVM 單層NN ，等
G（·）取怎樣的資訊傳遞方式

one hot 編碼及資料歸一化

問題由來在很多機器學習任務中，特徵並不總是連續值，而有可能是分類值。例如，考慮一下的三個特徵： ["male", "female"] ["from Europe", "from US", "from Asia"] ["uses Firefox", "uses Chrome", "uses

資料歸一化matlab及python 實現

歸一化的目的簡而言之，是使得沒有可比性的資料變得具有可比性，同時又保持相比較的兩個資料之間的相對關係。歸一化首先在維數非常多的時候，可以防止某一維或某幾維對資料影響過大，其次可以程式可以執行更快。資料歸一化應該針對屬性，而不是針對每條資料，針對每條資

資料歸一化及三種方法（python）

資料標準化（歸一化）處理是資料探勘的一項基礎工作，不同評價指標往往具有不同的量綱和量綱單位，這樣的情況會影響到資料分析的結果，為了消除指標之間的量綱影響，需要進行資料標準化處理，以解決資料指標之間的可比性。原始資料經過資料標準化處理後，各指標處於同一數量級，適合進行綜合對比評價。以下是三種常用的歸一化方法：m

C++ 實現matlab資料歸一化函式mapminmax

matlab驗證了我的資料處理方法，今天換成了c++版，實現matlab的mapminmax（）函式。程式碼如下： void normalize(float *data) { int datamax = 1; //設定歸一化的範圍 int datamin = 0;

Bobo老師機器學習筆記-資料歸一化

實現演算法： def normalizate_max_min(X): """ 利用最大和最小化方式進行歸一化，過一化的資料集中在【0， 1】 :param X: :return: """ np.asarray(X, dty

資料歸一化（續）

評價是現代社會各領域的一項經常性的工作，是科學做出管理決策的重要依據。隨著人們研究領域的不斷擴大，所面臨的評價物件日趨複雜，如果僅依據單一指標對事物進行評價往往不盡合理，必須全面地從整體的角度考慮問題，多指標綜合評價方法應運而生。所謂多指標綜合評價方法，就是把描述評價物件不同方面的多個指標的資訊綜合

機器學習之資料歸一化

器學習中，資料歸一化是非常重要，如果不進行資料歸一化，可能會導致模型壞掉或者訓練出一個奇怪的模型。為什麼要進行資料歸一化現在有一個訓練資料集，包含兩個樣本，內容如下：樣本1 1 200 樣本2 5

資料歸一化（標準化）

資料歸一化資料預處理中，標準的第一步是資料歸一化。雖然這裡有一系列可行的方法，但是這一步通常是根據資料的具體情況而明確選擇的。特徵歸一化常用的方法包含如下幾種： min-max標準化逐樣本均值消減(也稱為移除直流分量) Z-score 標準化(使資料集中所有特徵都具有零均值和單位方差)

資料歸一化，標準化，正則話的聯絡與區別

資料處理的features engineering過程中，常常需要根據演算法的input資料格式對資料進行預處理，對數值性數的表處理可以提高演算法的精度，保證演算法的可信度。常用的資料處理辦法有資料歸一化，標準話和正則話。 1：資料歸一化（Normalization） 1.把資料變為

資料歸一化/標準化

方法1：歸一化（normalization）:將值轉化為0—1之間 &n

scikit-learn中KNN演算法資料歸一化的分裝

import numpy as np class StandardScaler: def __init__(self): """初始化""" """用符號和下劃線表示非使用者傳入的引數""" self.mean_ =

資料歸一化到均值為零、單位方差

prestd這個函式說是將資料歸一化成零均值和單位方差。如 p = [-0.92 0.73 -0.47 0.74 0.29; -0.08 0.86 -0.67 -0.52 0.93]; [pn,meanp,stdp] = prestd(p);結果為 pn = -1.3389 0.8

機器學習之資料歸一化問題

1.機器學習中，為何要經常對資料做歸一化： 1）歸一化後加快了梯度下降求最優解的速度；2）歸一化有可能提高精度。 1)歸一化為什麼能提高梯度下降法求解最優解的速度：如下圖所示，藍色的圈圈圖代表的是兩個特徵的等高線。其中左圖兩個特徵X1和X2的區間相差非常大，X1區間是[0,2000]，

為什麼要資料歸一化和歸一化方法為什麼要資料歸一化和歸一化方法

轉為什麼要資料歸一化和歸一化方法 2017年09月22日 08:59:58 wuxiaosi808 閱讀數：11657

模式識別之樣本資料歸一化（Normalization）與標準化（Standardization）

% normalize each row to unit A = A./repmat(sqrt(sum(A.^2,2)),1,size(A,2)); % normalize each column to unit A = A./repmat(sqrt(sum(A.^2,1)),size(A,1),1);

NumPy學習筆記（4）--資料歸一化

# 歸一化，將矩陣規格化到0-1之間 import numpy as np a = 10*np.random.random((5, 5)) # 新建5*5矩陣做演示 print(a) print('---') amin, amax = a.min(), a.max() #

資料歸一化以及Python實現方式

資料歸一化：資料的標準化是將資料按比例縮放，使之落入一個小的特定區間，去除資料的單位限制，將其轉化為無量綱的純數值，便於不同單位或量級的指標能夠進行比較和加權。為什麼要做歸一化： 1）加快梯度下降求最優解的速度如果兩個特徵的區間相差非常大，其所

資料歸一化小結

在各種模型訓練，特徵選擇相關的演算法中，大量涉及到資料歸一化的問題。比如最常見的情況是計算距離，如果不同維度之間的取值範圍不一樣，比如feature1的取值範圍是[100,200],feature2的取值範圍是[1,2]，如果資料不做歸一化處理，會造成featu

C++ 將float資料歸一化到[0,1]

<pre name="code" class="cpp">std::ofstream fileout(‘features_normalize.txt’,std::ios::app); float ymax = 1; //歸一化資料範圍 float ymin =

統計資料歸一化與標準化

歸一化：１）把資料變成(０，１)之間的小數２）把有量綱表示式變成無量綱表示式歸一化演算法有： 1.線性轉換 y=(x-MinValue)/(MaxValue-MinValue ２.對數函式轉換： y=log10(x) ３.反餘切函式轉換 y=atan(x)*2/

one hot 編碼及資料歸一化

相關推薦