1. 程式人生 > 其它 >機器學習sklearn(九): 特徵工程(二)特徵離散化(二)特徵二值化

機器學習sklearn(九): 特徵工程(二)特徵離散化(二)特徵二值化

特徵二值化是將數值特徵用閾值過濾得到布林值的過程。這對於下游的概率型模型是有用的,它們假設輸入資料是多值伯努利分佈(Bernoulli distribution)。例如這個示例sklearn.neural_network.BernoulliRBM

即使歸一化計數(又名術語頻率)和TF-IDF值特徵在實踐中表現稍好一些,文字處理團隊也常常使用二值化特徵值(這可能會簡化概率估計)。

相比於Normalizer,實用程式類Binarizer也被用於sklearn.pipeline.Pipeline的早期步驟中。因為每個樣本被當做是獨立於其他樣本的,所以fit方法是無用的:

>>> X = [[ 1., -1.,  2.],
...      [ 
2., 0., 0.], ... [ 0., 1., -1.]] >>> binarizer = preprocessing.Binarizer().fit(X) # fit does nothing >>> binarizer Binarizer(copy=True, threshold=0.0) >>> binarizer.transform(X) array([[ 1., 0., 1.], [ 1., 0., 0.], [ 0., 1., 0.]])

也可以為二值化器賦一個閾值:

>>> binarizer = preprocessing.Binarizer(threshold=1.1)
>>> binarizer.transform(X) array([[ 0., 0., 1.], [ 1., 0., 0.], [ 0., 0., 0.]])

相比於StandardScalerNormalizer類的情況,預處理模組提供了一個相似的函式binarize,以便不需要轉換介面時使用。

稀疏輸入

binarize以及Binarizer接收來自scipy.sparse的密集類陣列資料以及稀疏矩陣作為輸入。

對於稀疏輸入,資料被轉化為壓縮的稀疏行形式(參見scipy.sparse.csr_matrix)。為了避免不必要的記憶體複製,推薦在上游選擇CSR表示。

class

sklearn.preprocessing.Binarizer(*,threshold=0.0,copy=True)

Binarize data (set feature values to 0 or 1) according to a threshold.

Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0. With the default threshold of 0, only positive values map to 1.

Binarization is a common operation on text count data where the analyst can decide to only consider the presence or absence of a feature rather than a quantified number of occurrences for instance.

It can also be used as a pre-processing step for estimators that consider boolean random variables (e.g. modelled using the Bernoulli distribution in a Bayesian setting).

Read more in theUser Guide.

Parameters
thresholdfloat, default=0.0

Feature values below or equal to this are replaced by 0, above it by 1. Threshold may not be less than 0 for operations on sparse matrices.

copybool, default=True

set to False to perform inplace binarization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR matrix).

Examples

>>> from sklearn.preprocessing import Binarizer
>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> transformer = Binarizer().fit(X)  # fit does nothing.
>>> transformer
Binarizer()
>>> transformer.transform(X)
array([[1., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])