機器學習sklearn（二十）：特徵工程（十一）特徵編碼（五）類別特徵編碼（三）獨熱編碼 OneHotEncoder

阿新 • • 發佈：2021-06-19

另外一種將標稱型特徵轉換為能夠被scikit-learn中模型使用的編碼是one-of-K，又稱為獨熱碼或dummy encoding。這種編碼型別已經在類OneHotEncoder中實現。該類把每一個具有n_categories個可能取值的categorical特徵變換為長度為n_categories的二進位制特徵向量，裡面只有一個地方是1，其餘位置都是0。

繼續我們上面的示例:

>>>
>>> enc = preprocessing.OneHotEncoder()
>>> X = [['male', 'from US', 'uses Safari 
'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X)  
OneHotEncoder(categorical_features=None, categories=None,
       dtype=<... 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)
>>> enc.transform([['female', 'from US', 'uses Safari'],
...                [ 
'male', 'from Europe', 'uses Safari']]).toarray()
array([[1., 0., 0., 1., 0., 1.],
       [0., 1., 1., 0., 0., 1.]])

預設情況下，每個特徵使用幾維的數值可以從資料集自動推斷。而且也可以在屬性categories_中找到:

>>>
>>> enc.categories_
[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array([' 
uses Firefox', 'uses Safari'], dtype=object)]

可以使用引數categories_顯式地指定這一點。我們的資料集中有兩種性別、四種可能的大陸和四種web瀏覽器:

>>> genders = ['female', 'male']
>>> locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
>>> browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
>>> enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
>>> # Note that for there are missing categorical values for the 2nd and 3rd
>>> # feature
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X)
OneHotEncoder(categorical_features=None,
       categories=[...], drop=None,
       dtype=<... 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])

如果訓練資料可能缺少分類特性，通常最好指定handle_unknown='ignore'，而不是像上面那樣手動設定類別。當指定handle_unknown='ignore'，並且在轉換過程中遇到未知類別時，不會產生錯誤，但是為該特性生成的一熱編碼列將全部為零(handle_unknown='ignore'只支援一熱編碼):

如果訓練資料中可能含有缺失的標稱型特徵, 通過指定handle_unknown='ignore'比像上面程式碼那樣手動設定categories更好。當handle_unknown='ignore' 被指定並在變換過程中真的碰到了未知的 categories, 則不會丟擲任何錯誤,但是由此產生的該特徵的one-hot編碼列將會全部變成 0 。(這個引數設定選項handle_unknown='ignore' 僅僅在 one-hot encoding的時候有效):

>>> enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X)
OneHotEncoder(categorical_features=None, categories=None, drop=None,
       dtype=<... 'numpy.float64'>, handle_unknown='ignore',
       n_values=None, sparse=True)
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[1., 0., 0., 0., 0., 0.]])

還可以使用drop引數將每個列編碼為n_categories-1列，而不是n_categories列。此引數允許使用者為要刪除的每個特徵指定類別。這對於避免某些分類器中輸入矩陣的共線性是有用的。例如，當使用非正則化迴歸(線性迴歸)時，這種功能是有用的，因為共線性會導致協方差矩陣是不可逆的。當這個引數不是None時，handle_unknown必須設定為error:

>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> drop_enc = preprocessing.OneHotEncoder(drop='first').fit(X)
>>> drop_enc.categories_
[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]
>>> drop_enc.transform(X).toarray()
array([[1., 1., 1.],
       [0., 0., 0.]])

標稱型特徵有時是用字典來表示的，而不是標量，具體請參閱從字典中載入特徵。

classsklearn.preprocessing.OneHotEncoder(*,categories='auto',drop=None,sparse=True,dtype=<class 'numpy.float64'>,handle_unknown='error')

Encode categorical features as a one-hot numeric array.

The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on thesparseparameter)

By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify thecategoriesmanually.

This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.

Note: a one-hot encoding of y labels should use a LabelBinarizer instead.

Read more in theUser Guide.

Parameters

categories‘auto’ or a list of array-like, default=’auto’

Categories (unique values) per feature:

‘auto’ : Determine categories automatically from the training data.
list :categories[i]holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.

The used categories can be found in thecategories_attribute.

New in version 0.20.

drop{‘first’, ‘if_binary’} or a array-like of shape (n_features,), default=None

Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.

However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.

None : retain all features (the default).
‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
‘if_binary’ : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.
array :drop[i]is the category in featureX[:,i]that should be dropped.

New in version 0.21:The parameterdropwas added in 0.21.

Changed in version 0.23:The optiondrop='if_binary'was added in 0.23.

sparsebool, default=True

Will return sparse matrix if set True else will return an array.

dtypenumber type, default=float

Desired dtype of output.

handle_unknown{‘error’, ‘ignore’}, default=’error’

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

Attributes

categories_list of arrays

The categories of each feature determined during fitting (in order of the features in X and corresponding with the output oftransform). This includes the category specified indrop(if any).

drop_idx_array of shape (n_features,)

drop_idx_[i]isthe index incategories_[i]of the category to be dropped for each feature.
drop_idx_[i]=Noneif no category is to be dropped from the feature with indexi, e.g. whendrop='if_binary'and the feature isn’t binary.
drop_idx_=Noneif all the transformed features will be retained.

Changed in version 0.23:Added the possibility to containNonevalues.

Examples

Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to a binary one-hot encoding.

>>> from sklearn.preprocessing import OneHotEncoder

One can discard categories not seen duringfit:

>>> enc = OneHotEncoder(handle_unknown='ignore')
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
OneHotEncoder(handle_unknown='ignore')
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 1], ['Male', 4]]).toarray()
array([[1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]])
>>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])
array([['Male', 1],
       [None, 2]], dtype=object)
>>> enc.get_feature_names(['gender', 'group'])
array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'],
  dtype=object)

One can always drop the first column for each feature:

>>> drop_enc = OneHotEncoder(drop='first').fit(X)
>>> drop_enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> drop_enc.transform([['Female', 1], ['Male', 2]]).toarray()
array([[0., 0., 0.],
       [1., 1., 0.]])

Or drop a column for feature only having 2 categories:

>>> drop_binary_enc = OneHotEncoder(drop='if_binary').fit(X)
>>> drop_binary_enc.transform([['Female', 1], ['Male', 2]]).toarray()
array([[0., 1., 0., 0.],
       [1., 0., 1., 0.]])

Methods

`fit`(X[,y])	Fit OneHotEncoder to X.
`fit_transform`(X[,y])	Fit OneHotEncoder to X, then transform X.
`get_feature_names`([input_features])	Return feature names for output features.
`get_params`([deep])	Get parameters for this estimator.
`inverse_transform`(X)	Convert the data back to the original representation.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Transform X using one-hot encoding.

機器學習sklearn（二十）：特徵工程（十一）特徵編碼（五）類別特徵編碼（三）獨熱編碼 OneHotEncoder

機器學習sklearn（十二）：特徵工程（三）特徵組合與交叉（一）多項式特徵

機器學習sklearn（十四）：特徵工程（五）特徵編碼（二）特徵雜湊(二)

機器學習sklearn（十六）：特徵工程（七）特徵選擇（二）卡方選擇（一）卡方檢驗

機器學習sklearn（十七）：特徵工程（八）特徵選擇（三）卡方選擇（二）卡方檢驗

機器學習sklearn（二十）：特徵工程（十一）特徵編碼（五）類別特徵編碼（三）獨熱編碼 OneHotEncoder

機器學習sklearn（十九）：特徵工程（十）特徵編碼（四）類別特徵編碼（二）標籤編碼 OrdinalEncoder

機器學習sklearn（二十一）：模型評估（一）交叉驗證：評估估算器的表現（一）簡介

機器學習sklearn（二十二）：模型評估（二）交叉驗證：評估估算器的表現（二）計算交叉驗證的指標

機器學習sklearn（三十二）：演算法例項（一）分類（一）分類決策樹（一）簡介

機器學習sklearn（四十）：演算法例項（九）迴歸（二）隨機森林迴歸器 RandomForestRegressor

機器學習sklearn（45）：特徵工程（十二）特徵編碼（六）處理分型別特徵：編碼與啞變數/處理連續型特徵：二值化與分段

機器學習sklearn（57）：演算法例項（十四）分類（七）邏輯迴歸（二）linear_model.LogisticRegression(一) 重要引數

機器學習sklearn（58）：演算法例項（十五）分類（八）邏輯迴歸（三）linear_model.LogisticRegression(二) 重要引數

機器學習sklearn（72）：演算法例項（二十九）分類（十六）SVM（七）sklearn.svm.SVC（六）使用SVC時的其他考慮（選）

機器學習sklearn（75）：演算法例項（三十二）迴歸（四）線性迴歸大家族（二）多元線性迴歸LinearRegression

機器學習sklearn（78）：演算法例項（三十五）迴歸（七）線性迴歸大家族（五）多重共線性：嶺迴歸與Lasso（二）Lasso

機器學習sklearn（五）：資料集處理（二）缺失值處理

機器學習sklearn（五）：資料處理（二）缺失值處理

機器學習sklearn（七）：資料處理（四）數值型資料處理（二）標準化 StandardScaler

機器學習sklearn（十）：資料處理（五）自定義轉換器

機器學習sklearn（二十）： 特徵工程（十一）特徵編碼（五）類別特徵編碼（三）獨熱編碼 OneHotEncoder

相關推薦

機器學習sklearn（二十）：特徵工程（十一）特徵編碼（五）類別特徵編碼（三）獨熱編碼 OneHotEncoder