【模型推理】量化實現分享二：詳解 KL 對稱量化演算法實現

阿新 • • 發佈：2021-12-17

歡迎關注我的公眾號 [極智視界]，回覆001獲取Google程式設計規範

O_o >_< o_O O_o ~_~ o_O

大家好，我是極智視界，本文剖析一下 KL 對稱量化演算法實現，以 Tengine 的實現為例。

前面已經寫過一篇《【模型推理】量化實現分享一：詳解 min-max 對稱量化演算法實現》，有興趣的同學可以查閱。這是上一篇的續集，也是量化實現詳解的第二篇。

量化背景就不多做介紹了，之前的文章中也說的比較多了，直接開始吧。

1、KL 量化原理

KL 量化是用 KL 散度來衡量真實資料分佈和量化資料分佈之間的相似性的量化方法，是英偉達 TensorRT 中對於啟用值採用的量化策略，KL 量化的主要邏輯如下：

KL 和 MIN-MAX 不一樣，不是直接將[min, max] 對映到 [-127, 127]，而是去尋找一個閾值 |T| < max(|max|, |min|)，將其 [-T, T] 對映到 [-127, 127]。認為只要閾值選取得當，就能將閾值以外的值捨棄掉，也不會對精度損失造成大的影響；
超出閾值 ±|T| 以外的值直接對映為閾值，如上圖中的三個紅色點，直接對映為 -127，這種對映關係稱為是飽和的。

KL 量化方法試圖將 float32 數值分佈和 int8 數值分佈抽象成兩個分佈，用閾值 |T| 來更新這兩個數值分佈，並用 KL 散度來衡量這兩個分佈的相似性，若 KL 散度值越小，說明這兩個分佈越相似，也就說明這個閾值 |T|

選擇的最好。對於對稱量化來說，根據這個閾值就能算出 Scale，而 Zero_point 始終為零。

下面的圖是 TensorRT 中的關於 KL 散度校準的虛擬碼，這個圖也完美詮釋了 KLD 整個量化過程。(標記一下下圖為圖二，後面會呼叫)

2、KL 量化實現

這裡還是以 Tengine 中 KL 量化的實現進行說明。

捋一下主要有以下幾個流程：

(1) 啟用值量化：先求 min、max，再用 KL 策略搜尋量化生成啟用值校準表。fp32toint8；

(2) 權值量化：使用 min-max 量化策略。fp32toint8；

(3) 偏置量化：延用啟用值量化 scale 進行 int32 量化。fp32toint32；

權值和偏置的量化比啟用值量化多一步，除了要計算 Scale 外，還需要對值應用 Scale 進行直接量化以生成 int8 tmfile。

在 Tengine 中實現 KL 量化的主要程式碼如下：

case ALGORITHM_KL:{
  if (quant_tool.scale_file.empty()){
    quant_tool.scale_file = "table_kl.scale";
    quant_tool.activation_quant_tool();
   }
  save_graph_i8_perchannel(quant_tool.model_file.c_str(), quant_tool.scale_file.c_str(), quant_tool.output_file, quant_tool.inplace, false);
  /* Evaluate quantitative losses */
  if (quant_tool.evaluate){
    fprintf(stderr, "[Quant Tools Info]: Step Evaluate, evaluate quantitative losses\n");
    quant_tool.assess_quant_loss(0);
   }
  break;
}

其中最主要的量化搜尋策略介面是 quant_tool.activation_quant_tool() 和 save_graph_i8_perchannel，對於 KL 量化來說這兩個介面分別做了兩件事：

(1) 啟用值量化，生成 table_kl.scale；

(2) 權值&偏置量化，生成 scale_weight.txt、scale_bias.txt 和 int8 tmfile；

由於啟用值量化中的 min、max 計算方式及權值&偏置量化過程，KL 量化和 MIN-MAX 量化邏輯相同且共用相同程式碼，這裡就不展開介紹了，這部分有興趣的同學可以查閱《【模型推理】量化實現分享一：詳解 min-max 對稱量化演算法實現》，這裡主要介紹啟用值量化中的 KL 量化搜尋策略。

KL 量化搜尋策略的入口在這：

quant_tool.activation_quant_tool();

然後會先做 min、max 的比較搜尋，主要用了 std::max_element、std::min_element 介面，這裡不多說，得到 min、max 值後開啟 KL 搜尋策略。

2.1 勾勒概率直方圖

做第一輪勾勒概率直方圖，進行第一輪的 KL 計算，第二輪開始不用重新勾勒概率直方圖，而是在第一輪構建的概率直方圖上進行迭代，所以你的校準圖片數量越多，這個最終得到的概率直方圖會越逼近真實分佈。


/* calculate hist */
uint32_t inum = 0;
for (int i = 0; i < ir_graph->tensor_num; i++){
  struct tensor* ir_tensor = ir_graph->tensor_list[i];
  if (ir_tensor->tensor_type == TENSOR_TYPE_VAR || ir_tensor->tensor_type == TENSOR_TYPE_INPUT){
    float step_max = std::abs(max_activation[i]);
    if (std::abs(min_activation[i]) > step_max)
      step_max = std::abs(min_activation[i]);
    float step_bin = step_max / 2048.0f;

    std::vector<float> every_edge;
    if (nums == imgs_list.size() - 1){
      for (int j = 0; j < 2048; j++){
        float edge_float = (step_bin * (j + 0.5f));
        every_edge.push_back(edge_float);
       }
      hist_edge.push_back(every_edge);
      hist_gram.push_back(histCount((float*)ir_tensor->data, ir_tensor->elem_num, step_max));
     }
    else{
      std::vector<uint32_t> hist_tmp;
      hist_tmp = histCount((float*)ir_tensor->data, ir_tensor->elem_num, step_max);
      for (int j = 0; j < 2048; j++){
        hist_gram[inum][j] += hist_tmp[j];}
     }
    tensor_hist[i] = inum;
    hist_tensor[inum] = i;
    inum++;}
}

來看以下 histCount 介面：

std::vector<uint32_t> histCount(float* data, uint32_t elem_num, float abs_max){
  float bin_scale = abs_max / 2047.f;
  int bin_zp = 0;
  std::vector<uint32_t> hist(2048);
  for (int i = 0; i < elem_num; i++){
    if (data[i] != 0){
      uint32_t hist_idx = round(std::abs(data[i]) / bin_scale);
      hist[hist_idx]++;}
   }
  return hist;
}

最後對得到的概率直方圖做一個歸一化處理：

distribution = normalize_histogram(distribution_in);

直方圖歸一化的實現介面也很簡單：

std::vector<float> normalize_histogram(std::vector<uint32_t>& histogram){
  std::vector<float> histogram_out(histogram.size());
  const size_t length = histogram.size();
  float sum = 0;
  for (size_t i = 1; i < length; i++)
    sum += histogram[i];

  for (size_t i = 1; i < length; i++)
    histogram_out[i] = float(histogram[i] / sum);

  return histogram_out;
}

2.2 計算 P

接下來的邏輯需要回頭看一下圖二，先計算 P 再計算 Q 最後計算 KL 散度。

先是計算模擬量化分佈 P，從 target_bin = 128 --> 2048 遞增檢索，溢位部分對映到邊緣處理，可以把 P 認為是量化前 fp32 資料分佈，即真實分佈：

// get P
fill(quantize_distribution.begin(), quantize_distribution.end(), 0.0f);
const float num_per_bin = static_cast<float>(threshold) / static_cast<float>(target_bin);

for (int i = 0; i < target_bin; i++){
  const float start = static_cast<float>(i) * num_per_bin;
  const float end = start + num_per_bin;

  const int left_upper = static_cast<int>(ceil(start));
  if (static_cast<float>(left_upper) > start){
    const float left_scale = static_cast<float>(left_upper) - start;
    quantize_distribution[i] += left_scale * distribution[left_upper - 1];
   }

  const int right_lower = static_cast<int>(floor(end));

  if (static_cast<float>(right_lower) < end){
    const float right_scale = end - static_cast<float>(right_lower);
    quantize_distribution[i] += right_scale * distribution[right_lower];
   }

  for (int j = left_upper; j < right_lower; j++){
    quantize_distribution[i] += distribution[j];}
}

2.2 計算 Q

然後是計算真實量化分佈 Q，伴隨 P 從 target_bin = 128 --> 2048 遞增檢索，可以把 Q 認為是量化後 int8 資料分佈，即量化分佈：

// get Q
std::vector<float> expand_distribution(threshold, 0);
for (int i = 0; i < target_bin; i++){
  const float start = static_cast<float>(i) * num_per_bin;
  const float end = start + num_per_bin;
  float count = 0;

  const int left_upper = static_cast<int>(ceil(start));
  float left_scale = 0;
  if (static_cast<float>(left_upper) > start){
    left_scale = static_cast<float>(left_upper) - start;
    if (distribution[left_upper - 1] != 0){
      count += left_scale;}
   }

  const int right_lower = static_cast<int>(floor(end));
  float right_scale = 0;
  if (static_cast<float>(right_lower) < end){
    right_scale = end - static_cast<float>(right_lower);
    if (distribution[right_lower] != 0){
      count += right_scale;}
   }

  for (int j = left_upper; j < right_lower; j++){
    if (distribution[j] != 0){
      count++;}
   }

  const float expand_value = quantize_distribution[i] / count;

  if (static_cast<float>(left_upper) > start){
    if (distribution[left_upper - 1] != 0){
      expand_distribution[left_upper - 1] += expand_value * left_scale;}
   }
  if (static_cast<float>(right_lower) < end){
    if (distribution[right_lower] != 0){
      expand_distribution[right_lower] += expand_value * right_scale;}
   }
  for (int j = left_upper; j < right_lower; j++){
    if (distribution[j] != 0){
      expand_distribution[j] += expand_value;}}
}

2.3 計算 KL 散度

接下來是計算真實分佈 P 和量化分佈 Q 的 KL 散度：

const float kl_divergence = compute_kl_divergence(t_distribution, expand_distribution);

實現 KL 散度計算的介面也很簡單：

float compute_kl_divergence(std::vector<float>& dist_a, std::vector<float>& dist_b){
  const size_t length = dist_a.size();
  float result = 0;

  for (size_t i = 0; i < length; i++){
    if (dist_a[i] != 0){
      if (dist_b[i] == 0){
        result += 1;
       }
      else{
        result += dist_a[i] * log(dist_a[i] / dist_b[i]);}}
   }
  return result;
}

最終我們是想找到一個使 KL 散度最小的 target_bin，由於是在 128 --> 2048 的迴圈中檢索的，所以這個實現可以這麼寫：

// the best num of bin
if (kl_divergence < min_kl_divergence)
{
   min_kl_divergence = kl_divergence;
   target_threshold = threshold;
}

這樣就得到了我們夢寐以求的那個 target_bin，也就是這裡的 target_threshold。

2.4 計算 Scale

在計算得到 target_threshold 後，再去計算 Scale 就很簡單了，直接這樣就好了。

float act_scale = hist_edge[i][threshold_bin] / fake_quant_set;  // fake_quant_set = 127
int act_zero_point = 0;

重申，由於是對稱量化，所以只需計算 Scale，Zero_point 始終為零。

然後就可以儲存我們的啟用值量化校準表 table_kl.scale 了，再次重申，後面的權值&偏置量化方法和 MIN-MAX 的一致，而 MIN-MAX 的量化方法我在前面的文章中已經介紹過，這裡就不多贅述。

以上就完成了實用的 KL 散度量化演算法的實現，希望我的分享能對你的學習有一點幫助。

【公眾號傳送】

《【模型推理】量化實現分享二：詳解 KL 對稱量化演算法實現》

【模型推理】量化實現分享二：詳解 KL 對稱量化演算法實現

1、KL 量化原理

2、KL 量化實現

2.1 勾勒概率直方圖

2.2 計算 P

2.2 計算 Q

2.3 計算 KL 散度

2.4 計算 Scale

【模型推理】量化實現分享二：詳解 KL 對稱量化演算法實現

【模型推理】教你用 C++ 實現一般模型推理圖片預處理模組

【車牌識別】-車牌中字元分割程式碼詳解

【C語言】貪吃蛇小遊戲程式碼詳解

AcWing 1086. 恨7不成妻（【程式碼簡潔】標準記憶化搜尋+超詳解！！）

Python實現棧的方法詳解【基於陣列和單鏈表兩種方法】

【Selenium學習】WebDriverApi介面和二次開發

【藍橋杯】順時針列印二維陣列

【藍橋杯】Z形列印二維陣列

【模型部署】使用Flask部署演算法模型

【模型部署】TFX介紹

P5597 【XR-4】復讀思維題 +二叉樹合併

【牛客】matrix，字串二維雜湊

深入詳解Go的channel底層實現原理【圖解】

【C語言】資料結構C語言版實驗1 線性表的順序實現

【Java學習筆記（一百二十二）】之 Java平臺模組系統詳解

【Python學習】影象識別-驗證二維碼

【第二期】測試工具分享

【STM32H7教程】第92章 STM32H7的FDCAN匯流排應用之雙FDCAN實現（支援經典CAN）

【資料結構與演算法】不同路徑 III：使用哈密爾頓路徑演算法實現

【模型推理】量化實現分享二：詳解 KL 對稱量化演算法實現

1、KL 量化原理

2、KL 量化實現

2.1 勾勒概率直方圖

2.2 計算 P

2.2 計算 Q

2.3 計算 KL 散度

2.4 計算 Scale

相關推薦