機器學習sklearn（十一）：資料處理（六）非線性轉換

阿新 • • 發佈：2021-06-19

有兩種型別的轉換是可用的:分位數轉換和冪函式轉換。分位數和冪變換都基於特徵的單調變換，從而保持了每個特徵值的秩。

通過執行秩變換，分位數變換平滑了異常分佈，並且比縮放方法受異常值的影響更小。但是它的確使特徵間及特徵內的關聯和距離失真了。

冪變換則是一組引數變換，其目的是將資料從任意分佈對映到接近高斯分佈的位置。

1 對映到均勻分佈

QuantileTransformer類以及quantile_transform函式提供了一個基於分位數函式的無引數轉換，將資料對映到了零到一的均勻分佈上:

>>> from sklearn.datasets import load_iris
 
>>> from sklearn.model_selection import train_test_split
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
>>> quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
 
>>> X_train_trans = quantile_transformer.fit_transform(X_train)
>>> X_test_trans = quantile_transformer.transform(X_test)
>>> np.percentile(X_train[:, 0], [0, 25, 50, 75, 100])
array([ 4.3,  5.1,  5.8,  6.5,  7.9])

這個特徵是萼片的釐米單位的長度。一旦應用分位數轉換，這些元素就接近於之前定義的百分位數:

>>> np.percentile(X_train_trans[:, 0], [0, 25, 50, 75, 100])
...
array([  
0.00... ,  0.24...,  0.49...,  0.73...,  0.99... ])

這可以在具有類似形式的獨立測試集上確認:

>>> np.percentile(X_test[:, 0], [0, 25, 50, 75, 100])
...
array([ 4.4  ,  5.125,  5.75 ,  6.175,  7.3  ])
>>> np.percentile(X_test_trans[:, 0], [0, 25, 50, 75, 100])
...
array([ 0.01...,  0.25...,  0.46...,  0.60... ,  0.94...])

2 對映到高斯分佈

在許多建模場景中，需要資料集中的特徵的正態化。冪變換是一類引數化的單調變換，其目的是將資料從任何分佈對映到儘可能接近高斯分佈，以便穩定方差和最小化偏斜。

類PowerTransformer目前提供兩個這樣的冪變換,Yeo-Johnson transform和the Box-Cox transform。

Yeo-Johnson transform:

Box-Cox transform:

Box-Cox只能應用於嚴格的正資料。在這兩種方法中，變換都是引數化的，通過極大似然估計來確定。下面是一個使用Box-Cox將樣本從對數正態分佈對映到正態分佈的示例:

>>> pt = preprocessing.PowerTransformer(method='box-cox', standardize=False)
>>> X_lognormal = np.random.RandomState(616).lognormal(size=(3, 3))
>>> X_lognormal                                         
array([[1.28..., 1.18..., 0.84...],
       [0.94..., 1.60..., 0.38...],
       [1.35..., 0.21..., 1.09...]])
>>> pt.fit_transform(X_lognormal)                   
array([[ 0.49...,  0.17..., -0.15...],
       [-0.05...,  0.58..., -0.57...],
       [ 0.69..., -0.84...,  0.10...]])

上述示例設定了引數standardize的選項為 False 。但是，預設情況下，類PowerTransformer將會應用zero-mean,unit-variance normalization到變換出的輸出上。

下面的示例中將 Box-Cox 和 Yeo-Johnson 應用到各種不同的概率分佈上。請注意當把這些方法用到某個分佈上的時候，冪變換得到的分佈非常像高斯分佈。但是對其他的一些分佈，結果卻不太有效。這更加強調了在冪變換前後對資料進行視覺化的重要性。

我們也可以使用類QuantileTransformer(通過設定output_distribution='normal')把資料變換成一個正態分佈。下面是將其應用到iris dataset上的結果:

>>> quantile_transformer = preprocessing.QuantileTransformer(
...     output_distribution='normal', random_state=0)
>>> X_trans = quantile_transformer.fit_transform(X)
>>> quantile_transformer.quantiles_
array([[4.3, 2. , 1. , 0.1],
       [4.4, 2.2, 1.1, 0.1],
       [4.4, 2.2, 1.2, 0.1],
       ...,
       [7.7, 4.1, 6.7, 2.5],
       [7.7, 4.2, 6.7, 2.5],
       [7.9, 4.4, 6.9, 2.5]])

因此，輸入的中值變成了輸出的均值，以0為中心。正態輸出被裁剪以便輸入的最大最小值(分別對應於1e-7和1-1e-7)不會在變換之下變成無窮。

API

classsklearn.preprocessing.QuantileTransformer(*,n_quantiles=1000,output_distribution='uniform',ignore_implicit_zeros=False,subsample=100000,random_state=None,copy=True)

Transform features using quantiles information.

This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.

The transformation is applied on each feature independently. First an estimate of the cumulative distribution function of a feature is used to map the original values to a uniform distribution. The obtained values are then mapped to the desired output distribution using the associated quantile function. Features values of new/unseen data that fall below or above the fitted range will be mapped to the bounds of the output distribution. Note that this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.

Read more in theUser Guide.

New in version 0.19.

Parameters

n_quantilesint, default=1000 or n_samples: Number of quantiles to be computed. It corresponds to the number of landmarks used to discretize the cumulative distribution function. If n_quantiles is larger than the number of samples, n_quantiles is set to the number of samples as a larger number of quantiles does not give a better approximation of the cumulative distribution function estimator.
output_distribution{‘uniform’, ‘normal’}, default=’uniform’: Marginal distribution for the transformed data. The choices are ‘uniform’ (default) or ‘normal’.
ignore_implicit_zerosbool, default=False: Only applies to sparse matrices. If True, the sparse entries of the matrix are discarded to compute the quantile statistics. If False, these entries are treated as zeros.
subsampleint, default=1e5: Maximum number of samples used to estimate the quantiles for computational efficiency. Note that the subsampling procedure may differ for value-identical sparse and dense matrices.
random_stateint, RandomState instance or None, default=None: Determines random number generation for subsampling and smoothing noise. Please seesubsamplefor more details. Pass an int for reproducible results across multiple function calls. SeeGlossary
copybool, default=True: Set to False to perform inplace transformation and avoid a copy (if the input is already a numpy array).

Attributes

n_quantiles_int: The actual number of quantiles used to discretize the cumulative distribution function.
quantiles_ndarray of shape (n_quantiles, n_features): The values corresponding the quantiles of reference.
references_ndarray of shape (n_quantiles, ): Quantiles of references.

Examples

>>> import numpy as np
>>> from sklearn.preprocessing import QuantileTransformer
>>> rng = np.random.RandomState(0)
>>> X = np.sort(rng.normal(loc=0.5, scale=0.25, size=(25, 1)), axis=0)
>>> qt = QuantileTransformer(n_quantiles=10, random_state=0)
>>> qt.fit_transform(X)
array([...])

API

classsklearn.preprocessing.PowerTransformer(method='yeo-johnson',*,standardize=True,copy=True)

Apply a power transform featurewise to make data more Gaussian-like.

Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.

Currently, PowerTransformer supports the Box-Cox transform and the Yeo-Johnson transform. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.

Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data.

By default, zero-mean, unit-variance normalization is applied to the transformed data.

機器學習sklearn（十一）：資料處理（六）非線性轉換

1 對映到均勻分佈

2 對映到高斯分佈

機器學習sklearn（十一）：資料處理（六）非線性轉換

機器學習sklearn（六）：資料處理（三）數值型資料處理（一）歸一化( MinMaxScaler/MaxAbsScaler)

機器學習sklearn（十）：資料處理（五）自定義轉換器

機器學習sklearn（五）：資料處理（二）缺失值處理

機器學習sklearn（七）：資料處理（四）數值型資料處理（二）標準化 StandardScaler

機器學習sklearn（44）：資料處理（七）資料無量綱化/缺失值

Hadoop基礎（二十八）：資料清洗（ETL）（一）簡單解析版

ALINK(十四)：資料處理（一）資料拆分 (SplitBatchOp)

ALINK(十七)：資料處理（三）缺失值處理(一)缺失值填充批預測

ALINK(二十一)：資料處理（七）數值型資料處理（三）絕對值最大化 (MaxAbsScalerTrainBatchOp/MaxAbsScalerPredictBatchOp)

Hadoop基礎（二十九）：資料清洗（ETL）（二）複雜解析版

ALINK(十八)：資料處理（四）缺失值處理(二)缺失值填充訓練 (ImputerTrainBatchOp)

ALINK(二十)：資料處理（六）數值型資料處理（二）標準化 (StandardScalerPredictBatchOp/StandardScalerTrainBatchOp )

VUE移動端音樂APP學習【二十一】：熱門搜尋開發

Flink實戰（九十三）：資料傾斜（二）keyby 視窗資料傾斜的優化

pandas（13）：資料清洗（重複記錄）

機器學習sklearn（十二）：特徵工程（三）特徵組合與交叉（一）多項式特徵

機器學習sklearn（十五）：特徵工程（六）特徵選擇（一）主成分分析PCA

機器學習sklearn（十六）：特徵工程（七）特徵選擇（二）卡方選擇（一）卡方檢驗

機器學習sklearn（十八）：特徵工程（九）特徵編碼（三）類別特徵編碼（一）標籤編碼 LabelEncoder

機器學習sklearn（十一）： 資料處理（六）非線性轉換

1 對映到均勻分佈

2 對映到高斯分佈

相關推薦

機器學習sklearn（十一）：資料處理（六）非線性轉換