機器學習sklearn(十一): 資料處理(六)非線性轉換
有兩種型別的轉換是可用的:分位數轉換和冪函式轉換。分位數和冪變換都基於特徵的單調變換,從而保持了每個特徵值的秩。
通過執行秩變換,分位數變換平滑了異常分佈,並且比縮放方法受異常值的影響更小。但是它的確使特徵間及特徵內的關聯和距離失真了。
冪變換則是一組引數變換,其目的是將資料從任意分佈對映到接近高斯分佈的位置。
1 對映到均勻分佈
QuantileTransformer
類以及quantile_transform
函式提供了一個基於分位數函式的無引數轉換,將資料對映到了零到一的均勻分佈上:
>>> from sklearn.datasets import load_iris>>> from sklearn.model_selection import train_test_split >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) >>> quantile_transformer = preprocessing.QuantileTransformer(random_state=0)>>> X_train_trans = quantile_transformer.fit_transform(X_train) >>> X_test_trans = quantile_transformer.transform(X_test) >>> np.percentile(X_train[:, 0], [0, 25, 50, 75, 100]) array([ 4.3, 5.1, 5.8, 6.5, 7.9])
這個特徵是萼片的釐米單位的長度。一旦應用分位數轉換,這些元素就接近於之前定義的百分位數:
>>> np.percentile(X_train_trans[:, 0], [0, 25, 50, 75, 100]) ... array([0.00... , 0.24..., 0.49..., 0.73..., 0.99... ])
這可以在具有類似形式的獨立測試集上確認:
>>> np.percentile(X_test[:, 0], [0, 25, 50, 75, 100]) ... array([ 4.4 , 5.125, 5.75 , 6.175, 7.3 ]) >>> np.percentile(X_test_trans[:, 0], [0, 25, 50, 75, 100]) ... array([ 0.01..., 0.25..., 0.46..., 0.60... , 0.94...])
2 對映到高斯分佈
在許多建模場景中,需要資料集中的特徵的正態化。冪變換是一類引數化的單調變換, 其目的是將資料從任何分佈對映到儘可能接近高斯分佈,以便穩定方差和最小化偏斜。
類PowerTransformer目前提供兩個這樣的冪變換,Yeo-Johnson transform
和the Box-Cox transform
。
Yeo-Johnson transform:
Box-Cox transform:
Box-Cox只能應用於嚴格的正資料。在這兩種方法中,變換都是引數化的,通過極大似然估計來確定。下面是一個使用Box-Cox將樣本從對數正態分佈對映到正態分佈的示例:
>>> pt = preprocessing.PowerTransformer(method='box-cox', standardize=False) >>> X_lognormal = np.random.RandomState(616).lognormal(size=(3, 3)) >>> X_lognormal array([[1.28..., 1.18..., 0.84...], [0.94..., 1.60..., 0.38...], [1.35..., 0.21..., 1.09...]]) >>> pt.fit_transform(X_lognormal) array([[ 0.49..., 0.17..., -0.15...], [-0.05..., 0.58..., -0.57...], [ 0.69..., -0.84..., 0.10...]])
上述示例設定了引數standardize
的選項為 False 。 但是,預設情況下,類PowerTransformer將會應用zero-mean
,unit-variance normalization
到變換出的輸出上。
下面的示例中 將 Box-Cox 和 Yeo-Johnson 應用到各種不同的概率分佈上。 請注意 當把這些方法用到某個分佈上的時候, 冪變換得到的分佈非常像高斯分佈。但是對其他的一些分佈,結果卻不太有效。這更加強調了在冪變換前後對資料進行視覺化的重要性。
我們也可以 使用類QuantileTransformer(通過設定output_distribution
='normal')把資料變換成一個正態分佈。下面是將其應用到iris dataset上的結果:
>>> quantile_transformer = preprocessing.QuantileTransformer( ... output_distribution='normal', random_state=0) >>> X_trans = quantile_transformer.fit_transform(X) >>> quantile_transformer.quantiles_ array([[4.3, 2. , 1. , 0.1], [4.4, 2.2, 1.1, 0.1], [4.4, 2.2, 1.2, 0.1], ..., [7.7, 4.1, 6.7, 2.5], [7.7, 4.2, 6.7, 2.5], [7.9, 4.4, 6.9, 2.5]])
因此,輸入的中值變成了輸出的均值,以0為中心。正態輸出被裁剪以便輸入的最大最小值(分別對應於1e-7和1-1e-7)不會在變換之下變成無窮。
API
classsklearn.preprocessing.
QuantileTransformer
(*,n_quantiles=1000,output_distribution='uniform',ignore_implicit_zeros=False,subsample=100000,random_state=None,copy=True)
Transform features using quantiles information.
This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.
The transformation is applied on each feature independently. First an estimate of the cumulative distribution function of a feature is used to map the original values to a uniform distribution. The obtained values are then mapped to the desired output distribution using the associated quantile function. Features values of new/unseen data that fall below or above the fitted range will be mapped to the bounds of the output distribution. Note that this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.
Read more in theUser Guide.
New in version 0.19.
- Parameters
- n_quantilesint, default=1000 or n_samples
-
Number of quantiles to be computed. It corresponds to the number of landmarks used to discretize the cumulative distribution function. If n_quantiles is larger than the number of samples, n_quantiles is set to the number of samples as a larger number of quantiles does not give a better approximation of the cumulative distribution function estimator.
- output_distribution{‘uniform’, ‘normal’}, default=’uniform’
-
Marginal distribution for the transformed data. The choices are ‘uniform’ (default) or ‘normal’.
- ignore_implicit_zerosbool, default=False
-
Only applies to sparse matrices. If True, the sparse entries of the matrix are discarded to compute the quantile statistics. If False, these entries are treated as zeros.
- subsampleint, default=1e5
-
Maximum number of samples used to estimate the quantiles for computational efficiency. Note that the subsampling procedure may differ for value-identical sparse and dense matrices.
- random_stateint, RandomState instance or None, default=None
-
Determines random number generation for subsampling and smoothing noise. Please see
subsample
for more details. Pass an int for reproducible results across multiple function calls. SeeGlossary - copybool, default=True
-
Set to False to perform inplace transformation and avoid a copy (if the input is already a numpy array).
- Attributes
- n_quantiles_int
-
The actual number of quantiles used to discretize the cumulative distribution function.
- quantiles_ndarray of shape (n_quantiles, n_features)
-
The values corresponding the quantiles of reference.
- references_ndarray of shape (n_quantiles, )
-
Quantiles of references.
Examples
>>> import numpy as np >>> from sklearn.preprocessing import QuantileTransformer >>> rng = np.random.RandomState(0) >>> X = np.sort(rng.normal(loc=0.5, scale=0.25, size=(25, 1)), axis=0) >>> qt = QuantileTransformer(n_quantiles=10, random_state=0) >>> qt.fit_transform(X) array([...])
API
classsklearn.preprocessing.
PowerTransformer
(method='yeo-johnson',*,standardize=True,copy=True)
Apply a power transform featurewise to make data more Gaussian-like.
Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.
Currently, PowerTransformer supports the Box-Cox transform and the Yeo-Johnson transform. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.
Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data.
By default, zero-mean, unit-variance normalization is applied to the transformed data.
Read more in theUser Guide.
New in version 0.20.
- Parameters
- method{‘yeo-johnson’, ‘box-cox’}, default=’yeo-johnson’
-
The power transform method. Available methods are:
- standardizebool, default=True
-
Set to True to apply zero-mean, unit-variance normalization to the transformed output.
- copybool, default=True
-
Set to False to perform inplace computation during transformation.
- Attributes
- lambdas_ndarray of float of shape (n_features,)
-
The parameters of the power transformation for the selected features.
Examples
>>> import numpy as np >>> from sklearn.preprocessing import PowerTransformer >>> pt = PowerTransformer() >>> data = [[1, 2], [3, 2], [4, 5]] >>> print(pt.fit(data)) PowerTransformer() >>> print(pt.lambdas_) [ 1.386... -3.100...] >>> print(pt.transform(data)) [[-1.316... -0.707...] [ 0.209... -0.707...] [ 1.106... 1.414...]]