1. 程式人生 > >第十八節,TensorFlow中使用批量歸一化

第十八節,TensorFlow中使用批量歸一化

item con 用法 它的 線性 dev 樣本 需要 sca

在深度學習章節裏,已經介紹了批量歸一化的概念,詳情請點擊這裏:第九節,改善深層神經網絡:超參數調試、正則化以優化(下)

由於在深層網絡中,不同層的分布都不一樣,會導致訓練時出現飽和的問題。而批量歸一化就是為了緩解這個問題提出的。而且在實際應用中,批量歸一化的收斂非常快,並且具有很強的泛化能力,某種情況下完全可以替代正則化和棄權。

一 批量歸一化函數

歸一化算法可以描述為:

技術分享圖片

1.TensorFlow中自帶BN函數的定義:

def batch_normalization(x,
                        mean,
                        variance,
                        offset,
                        scale,
                        variance_epsilon,
                        name
=None):

各個參數的說明如下:

  • x:代表任意維度的輸入張量。
  • mean:代表樣本的均值。
  • variance:代表樣本的方差。
  • offset:代表偏移,即相加一個轉化值,也是公式中的beta。
  • scale:代表縮放,即乘以一個轉化值,也是公式中的gamma。
  • variance_epsilon:是為了避免分母為0的情況下,給分母加上的一個極小值,默認即可。
  • name:名稱。

要想使用這個整數,必須由另一個函數配合使用,tf.nn.moments,由它來計算均值和方差,然後就可以使用BN了。

2.tf.nn.moment()函數的定義如下:

def moments(x, axes, shift=None, name=None, keep_dims=False):
  • x:輸入張量。
  • axes:指定沿哪個軸計算平均值和方差。
  • shift:A `Tensor` containing the value by which to shift the data for numerical stability, or `None` in which case the true mean of the data is used as shift. A shift close to the true mean provides the most numerically stable results.
  • name:名稱。
  • keep_dims:是否保留維度,即形狀是否和輸入一樣。

有了以上兩個函數還不夠,為了有更好的效果,我們希望使用指數加權平均的方法來優化每次的均值和方差,於是就用到了tf.train.ExponentialMovingAverage()類,它的作用是讓上一次的值對本次的值有個衰減後的影響,從而使每次的值連起來後會相對平滑一下:詳細內容可以點擊這裏:第八節,改善深層神經網絡:超參數調試、正則化以優化(中)

我們可以用一個表達式來表示這個函數的功能:

shadow_variable = decay * shadow_variable + (1 - decay) *variable

各參數說明如下:

  • decay:代表衰減指數,是在ExponentialMovingAverage()中指定的,一般為0.9.
  • variable:代表本批次樣本中的值。
  • 等式右邊的shadow_variable:代表上次總樣本的值。
  • 等式左邊的shadow_variable:代表本次次總樣本的值。

對於shadow_variable的理解,你可以將其人為該數值表示的是1/(1-β)次的平均值,本次樣本所占的權重為(1-decay),上次樣本所占權重為(1-decay)decay,上上次樣本所占權重為(1-decay)decay^2,以此類推....

3.tf.train.ExponentialMovingAverage類的定義如下:

  def __init__(self, decay, num_updates=None, zero_debias=False,
               name="ExponentialMovingAverage"):
 def apply(self, var_list=None):

參數說明如下:

  • decay: Float. The decay to use.
  • num_updates: Optional count of number of updates applied to variables. actual decay rate used is: `min(decay, (1 + num_updates) / (10 + num_updates))
  • zero_debias: If `True`, zero debias moving-averages that are initialized with tensors.
  • name: String. Optional prefix name to use for the name of ops added in.
  • var_list: A list of Variable or Tensor objects. The variables and Tensors must be of types float16, float32, or float64.apply

通過調用apply()函數可以更新指數加權平均值。

二 批量歸一化的簡單用法

上面的函數雖然參數不多,但是需要幾個函數聯合起來使用,於是TensorFlow中的layers模塊裏又實現了一次BN函數,相當於把幾個函數合並到了一起,使用起來更加簡單。下面來介紹一下,使用時需要引入:

from tensorflow.contrib.layers.python.layers import batch_norm

或者直接調用tf.contrib.layers.batch_norm(),該函數的定義如下:

def batch_norm(inputs,
               decay=0.999,
               center=True,
               scale=False,
               epsilon=0.001,
               activation_fn=None,
               param_initializers=None,
               param_regularizers=None,
               updates_collections=ops.GraphKeys.UPDATE_OPS,
               is_training=True,
               reuse=None,
               variables_collections=None,
               outputs_collections=None,
               trainable=True,
               batch_weights=None,
               fused=False,
               data_format=DATA_FORMAT_NHWC,
               zero_debias_moving_mean=False,
               scope=None,
               renorm=False,
               renorm_clipping=None,
               renorm_decay=0.99):

參數說明如下:

  • inputs: A tensor with 2 or more dimensions, where the first dimension has `batch_size`. The normalization is over all but the last dimension if `data_format` is `NHWC` and the second dimension if `data_format` is `NCHW`.代表輸入,第一個維度為batch_size
  • dacay:Decay for the moving average. Reasonable values for `decay` are close to 1.0, typically in the multiple-nines range: 0.999, 0.99, 0.9, etc. Lower `decay` value (recommend trying `decay`=0.9) if model experiences reasonably good training performance but poor validation and/or test performance. Try zero_debias_moving_mean=True for improved stability.代表加權指數平均值的衰減速度,是使用了一種叫做加權指數衰減的方法更新均值和方差。一般會設置為0.9,值太小會導致均值和方差更新太快,而值太大又會導致幾乎沒有衰減,容易出現過擬合,這種情況一般需要把值調小點。
  • center: If True, add offset of `beta` to normalized tensor. If False, `beta` is ignored. 指定是否使用偏移beta。
  • scale: If True, multiply by `gamma`. If False, `gamma` is not used. When the next layer is linear (also e.g. `nn.relu`), this can be disabled since the scaling can be done by the next layer.是否進行變換(通過乘以一個gamma進行縮放),我們習慣在BN後面接一個線性變化,如Relu,所以scale一般都設置為Flase,因為後面有對數據的轉換處理,所以這裏就不用再處理了。
  • epsilon: Small float added to variance to avoid dividing by zero.是為了避免分母為0的情況下,給分母加上的一個極小值,默認即可。
  • activation_fn: Activation function, default set to None to skip it and maintain a linear activation.激活函數,默認為None,即使用線性激活函數。
  • param_initializers: Optional initializers for beta, gamma, moving mean and moving variance.可選的初始化參數。
  • param_regularizers: Optional regularizer for beta and gamma.可選的正則化項。
  • updates_collections: Collections to collect the update ops for computation. The updates_ops need to be executed with the train_op. If None, a control dependency would be added to make sure the updates are computed in place.其變量默認是tf.GraphKeys.UPDATE_OPS,在訓練時提供了一種內置的均值和方差更新機制,即通過圖中的tf.Graphs.UPDATE_OPS變量來更新,但它是在每次當前批次訓練完成後才更新均值和方差,這樣就導致當前數據總是使用前一次的均值和方差,沒有得到最新的更新。所以一般都會將其設置為None,讓均值和方差即時更新。這樣雖然相比默認值在性能上稍慢點,但是對模型的訓練還是有很大幫助的。
  • is_training: Whether or not the layer is in training mode. In training mode it would accumulate the statistics of the moments into `moving_mean` and `moving_variance` using an exponential moving average with the given `decay`. When it is not in training mode then it would use the values of the `moving_mean` and the `moving_variance`.
  • reuse: Whether or not the layer and its variables should be reused. To be able to reuse the layer scope must be given.支持共享變量,與下面的scope參數聯合使用。
  • variables_collections: Optional collections for the variables.
  • outputs_collections: Collections to add the outputs.
  • trainable: If `True` also add variables to the graph collection `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
  • batch_weights: An optional tensor of shape `[batch_size]`, containing a frequency weight for each batch item. If present, then the batch normalization uses weighted mean and variance. (This can be used to correct for bias in training example selection.)
  • used: Use nn.fused_batch_norm if True, nn.batch_normalization otherwise.
  • data_format: A string. `NHWC` (default) and `NCHW` are supported.
  • zero_debias_moving_mean: Use zero_debias for moving_mean. It creates a new air of variables ‘moving_mean/biased‘ and ‘moving_mean/local_step‘.
  • scope: Optional scope for `variable_scope`.指定變量的作用域variable_scope。
  • renorm: Whether to use Batch Renormalization https://arxiv.org/abs/1702.03275). This adds extra variables during raining. The inference is the same for either value of this parameter.
  • renorm_clipping: A dictionary that may map keys ‘rmax‘, ‘rmin‘, ‘dmax‘ to scalar `Tensors` used to clip the renorm correction. The correction `(r, d)` is used as `corrected_value = normalized_value * r + d`, with `r` clipped to [rmin, rmax], and `d` to [-dmax, dmax]. Missing rmax, rmin, dmax are set to inf, 0, inf, respectively.
  • renorm_decay: Momentum used to update the moving means and standard deviations with renorm. Unlike `momentum`, this affects training and should be neither too small (which would add noise) nor too large (which would give stale estimates). Note that `decay` is still applied to get the means and variances for inference.

第十八節,TensorFlow中使用批量歸一化