1. 程式人生 > 其它 >機器學習sklearn(十一): 資料處理(六)非線性轉換

機器學習sklearn(十一): 資料處理(六)非線性轉換

有兩種型別的轉換是可用的:分位數轉換和冪函式轉換。分位數和冪變換都基於特徵的單調變換,從而保持了每個特徵值的秩。

通過執行秩變換,分位數變換平滑了異常分佈,並且比縮放方法受異常值的影響更小。但是它的確使特徵間及特徵內的關聯和距離失真了。

冪變換則是一組引數變換,其目的是將資料從任意分佈對映到接近高斯分佈的位置。

1 對映到均勻分佈

QuantileTransformer類以及quantile_transform函式提供了一個基於分位數函式的無引數轉換,將資料對映到了零到一的均勻分佈上:

>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) >>> quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
>>> X_train_trans = quantile_transformer.fit_transform(X_train) >>> X_test_trans = quantile_transformer.transform(X_test) >>> np.percentile(X_train[:, 0], [0, 25, 50, 75, 100]) array([ 4.3, 5.1, 5.8, 6.5, 7.9])

這個特徵是萼片的釐米單位的長度。一旦應用分位數轉換,這些元素就接近於之前定義的百分位數:

>>> np.percentile(X_train_trans[:, 0], [0, 25, 50, 75, 100])
...
array([ 
0.00... , 0.24..., 0.49..., 0.73..., 0.99... ])

這可以在具有類似形式的獨立測試集上確認:

>>> np.percentile(X_test[:, 0], [0, 25, 50, 75, 100])
...
array([ 4.4  ,  5.125,  5.75 ,  6.175,  7.3  ])
>>> np.percentile(X_test_trans[:, 0], [0, 25, 50, 75, 100])
...
array([ 0.01...,  0.25...,  0.46...,  0.60... ,  0.94...])

2 對映到高斯分佈

在許多建模場景中,需要資料集中的特徵的正態化。冪變換是一類引數化的單調變換, 其目的是將資料從任何分佈對映到儘可能接近高斯分佈,以便穩定方差和最小化偏斜。

PowerTransformer目前提供兩個這樣的冪變換,Yeo-Johnson transformthe Box-Cox transform

Yeo-Johnson transform:

Box-Cox transform:

Box-Cox只能應用於嚴格的正資料。在這兩種方法中,變換都是引數化的,通過極大似然估計來確定。下面是一個使用Box-Cox將樣本從對數正態分佈對映到正態分佈的示例:

>>> pt = preprocessing.PowerTransformer(method='box-cox', standardize=False)
>>> X_lognormal = np.random.RandomState(616).lognormal(size=(3, 3))
>>> X_lognormal                                         
array([[1.28..., 1.18..., 0.84...],
       [0.94..., 1.60..., 0.38...],
       [1.35..., 0.21..., 1.09...]])
>>> pt.fit_transform(X_lognormal)                   
array([[ 0.49...,  0.17..., -0.15...],
       [-0.05...,  0.58..., -0.57...],
       [ 0.69..., -0.84...,  0.10...]])

上述示例設定了引數standardize的選項為 False 。 但是,預設情況下,類PowerTransformer將會應用zero-mean,unit-variance normalization到變換出的輸出上。

下面的示例中 將 Box-Cox 和 Yeo-Johnson 應用到各種不同的概率分佈上。 請注意 當把這些方法用到某個分佈上的時候, 冪變換得到的分佈非常像高斯分佈。但是對其他的一些分佈,結果卻不太有效。這更加強調了在冪變換前後對資料進行視覺化的重要性。

我們也可以 使用類QuantileTransformer(通過設定output_distribution='normal')把資料變換成一個正態分佈。下面是將其應用到iris dataset上的結果:

>>> quantile_transformer = preprocessing.QuantileTransformer(
...     output_distribution='normal', random_state=0)
>>> X_trans = quantile_transformer.fit_transform(X)
>>> quantile_transformer.quantiles_
array([[4.3, 2. , 1. , 0.1],
       [4.4, 2.2, 1.1, 0.1],
       [4.4, 2.2, 1.2, 0.1],
       ...,
       [7.7, 4.1, 6.7, 2.5],
       [7.7, 4.2, 6.7, 2.5],
       [7.9, 4.4, 6.9, 2.5]])

因此,輸入的中值變成了輸出的均值,以0為中心。正態輸出被裁剪以便輸入的最大最小值(分別對應於1e-7和1-1e-7)不會在變換之下變成無窮。

API

classsklearn.preprocessing.QuantileTransformer(*,n_quantiles=1000,output_distribution='uniform',ignore_implicit_zeros=False,subsample=100000,random_state=None,copy=True)

Transform features using quantiles information.

This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.

The transformation is applied on each feature independently. First an estimate of the cumulative distribution function of a feature is used to map the original values to a uniform distribution. The obtained values are then mapped to the desired output distribution using the associated quantile function. Features values of new/unseen data that fall below or above the fitted range will be mapped to the bounds of the output distribution. Note that this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.

Read more in theUser Guide.

New in version 0.19.

Parameters
n_quantilesint, default=1000 or n_samples

Number of quantiles to be computed. It corresponds to the number of landmarks used to discretize the cumulative distribution function. If n_quantiles is larger than the number of samples, n_quantiles is set to the number of samples as a larger number of quantiles does not give a better approximation of the cumulative distribution function estimator.

output_distribution{‘uniform’, ‘normal’}, default=’uniform’

Marginal distribution for the transformed data. The choices are ‘uniform’ (default) or ‘normal’.

ignore_implicit_zerosbool, default=False

Only applies to sparse matrices. If True, the sparse entries of the matrix are discarded to compute the quantile statistics. If False, these entries are treated as zeros.

subsampleint, default=1e5

Maximum number of samples used to estimate the quantiles for computational efficiency. Note that the subsampling procedure may differ for value-identical sparse and dense matrices.

random_stateint, RandomState instance or None, default=None

Determines random number generation for subsampling and smoothing noise. Please seesubsamplefor more details. Pass an int for reproducible results across multiple function calls. SeeGlossary

copybool, default=True

Set to False to perform inplace transformation and avoid a copy (if the input is already a numpy array).

Attributes
n_quantiles_int

The actual number of quantiles used to discretize the cumulative distribution function.

quantiles_ndarray of shape (n_quantiles, n_features)

The values corresponding the quantiles of reference.

references_ndarray of shape (n_quantiles, )

Quantiles of references.

Examples

>>> import numpy as np
>>> from sklearn.preprocessing import QuantileTransformer
>>> rng = np.random.RandomState(0)
>>> X = np.sort(rng.normal(loc=0.5, scale=0.25, size=(25, 1)), axis=0)
>>> qt = QuantileTransformer(n_quantiles=10, random_state=0)
>>> qt.fit_transform(X)
array([...])

API

classsklearn.preprocessing.PowerTransformer(method='yeo-johnson',*,standardize=True,copy=True)

Apply a power transform featurewise to make data more Gaussian-like.

Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.

Currently, PowerTransformer supports the Box-Cox transform and the Yeo-Johnson transform. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.

Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data.

By default, zero-mean, unit-variance normalization is applied to the transformed data.

Read more in theUser Guide.

New in version 0.20.

Parameters
method{‘yeo-johnson’, ‘box-cox’}, default=’yeo-johnson’

The power transform method. Available methods are:

  • ‘yeo-johnson’[1], works with positive and negative values

  • ‘box-cox’[2], only works with strictly positive values

standardizebool, default=True

Set to True to apply zero-mean, unit-variance normalization to the transformed output.

copybool, default=True

Set to False to perform inplace computation during transformation.

Attributes
lambdas_ndarray of float of shape (n_features,)

The parameters of the power transformation for the selected features.

Examples

>>> import numpy as np
>>> from sklearn.preprocessing import PowerTransformer
>>> pt = PowerTransformer()
>>> data = [[1, 2], [3, 2], [4, 5]]
>>> print(pt.fit(data))
PowerTransformer()
>>> print(pt.lambdas_)
[ 1.386... -3.100...]
>>> print(pt.transform(data))
[[-1.316... -0.707...]
 [ 0.209... -0.707...]
 [ 1.106...  1.414...]]