MySQL 案例：analyze，慢查詢，與查詢無響應

問題描述

MySQL 偶爾會遇到執行計劃不準，導致查詢變慢，這時候一般會懷疑是索引資訊不準，去 analyze 一下，然後再 select 試一下，這時候可能會發現，select 會進入無響應的狀態，並且 analyze 的這個表上其他正常的查詢都會進入無響應的狀態。

解決方案

如果這種現象已經發生了，可以嘗試 kill 掉“最早的”那些慢查詢。

即如果 tb1 上有慢查詢，且進行了 analyze 後遇到了問題，找一下 tb1 上在 analyze 之前已經開始執行，但是沒結束的慢查詢，然後全部 kill 掉。

問題還原

先來構造一下場景：

CREATE TABLE `stu` (
  `id` int(11) NOT NULL,
  `name` varchar(16) DEFAULT NULL,
  `age` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_name` (`name`),
  KEY `idx_age` (`age`),
  KEY `idx_n_a` (`name`,`age`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

INSERT INTO `stu` VALUES (9,'adam',25),(7,'carlos',25),(1,'dave',19),(5,'sam',22),(3,'tom',22),(11,'zoe',29);

　　這時候來偽造一個長時間執行的慢查詢：

mysql> select sleep(3600) from stu;

　　然後在其他的 session 模擬 analyze 和 select 的操作：

mysql> analyze table stu;
+----------+---------+----------+----------+
| Table    | Op      | Msg_type | Msg_text |
+----------+---------+----------+----------+
| test.stu | analyze | status   | OK       |
+----------+---------+----------+----------+
1 row in set (0.00 sec)

mysql> select * from stu limit 1;

這時候會發現這個 limit 1 的語句也會被阻塞，而且也不會觸發innodb_lock_wait_timeout。

如果在其他的 session 看 processlist，會發現如下等待事件：

mysql> show processlist;
+-----+------+-----------------+--------------------+---------+------+-------------------------+-----------------------------+
| Id  | User | Host            | db                 | Command | Time | State                   | Info                        |
+-----+------+-----------------+--------------------+---------+------+-------------------------+-----------------------------+
| 457 | root | 127.0.0.1:48650 | sbtest             | Sleep   | 4860 |                         | NULL                        |
| 458 | root | 127.0.0.1:48652 | sbtest             | Sleep   | 4851 |                         | NULL                        |
| 473 | root | 127.0.0.1:49512 | performance_schema | Sleep   | 4834 |                         | NULL                        |
| 477 | root | 127.0.0.1:52364 | test               | Query   |   26 | User sleep              | select sleep(3600) from stu |
| 478 | root | 127.0.0.1:53124 | test               | Query   |   10 | Waiting for table flush | select * from stu limit 1   |
| 479 | root | 127.0.0.1:53944 | sbtest             | Query   |    0 | starting                | show processlist            |
| 480 | root | 127.0.0.1:53946 | sbtest             | Sleep   |  958 |                         | NULL                        |
+-----+------+-----------------+--------------------+---------+------+-------------------------+-----------------------------+
7 rows in set (0.00 sec)

mysql>

原因分析

那麼問題已經明瞭，且等待事件也清楚地指向了Waiting for table flush，那麼從這個地方入手，看看原因是什麼。先看看官方文件的解釋：

Waiting for table flush The thread is executing FLUSH TABLES and is waiting for all threads to close their tables, or the thread got a notification that the underlying structure for a table has changed and it needs to reopen the table to get the new structure. However, to reopen the table, it must wait until all other threads have closed the table in question.This notification takes place if another thread has used FLUSH TABLES or one of the following statements on the table in question: FLUSH TABLES tbl_name, ALTER TABLE, RENAME TABLE, REPAIR TABLE, ANALYZE TABLE, or OPTIMIZE TABLE.

可以看到這個狀態出現的原因已經寫得很清楚了：因為這個表的結構已經“改變”，所以新執行緒在開啟 table 的時候需要等其他的執行緒先關閉。

那麼再看一下analyze到底幹了什麼，引用官方文件的內容：

ANALYZE TABLE removes the table from the table definition cache, which requires a flush lock. If there are long running statements or transactions still using the table, subsequent statements and transactions must wait for those operations to finish before the flush lock is released. Because ANALYZE TABLE itself typically finishes quickly, it may not be apparent that delayed transactions or statements involving the same table are due to the remaining flush lock. ...... ANALYZE TABLE clears table statistics from the INFORMATION_SCHEMA.INNODB_SYS_TABLESTATS table and sets the STATS_INITIALIZED column to Uninitialized. Statistics are collected again the next time the table is accessed.

通過描述可以看到

可以看到analyze會嘗試獲取 flush 鎖，同時重新做資料取樣的操作其實是由下一個select發起的。

那麼問題變成了：實際阻塞的時候，是在重新做資料取樣時，還是在等待其他執行緒關閉 table？

一些準備知識

首先要了解一下 MySQL 的兩個東西：table_defination 和 table_open_cache，簡單來說，一個 Client 想 open table 的時候，會先嚐試從 cache 裡面拿，如果這個表有“新版本”，或者 cache 沒有的時候，就會從 table_defination 去 copy 一份最新的資料。

詳細的分析

在上文構造的環境裡面，掃一下堆疊的資訊，看看發生了什麼，去掉干擾資訊，找到 select 語句的資訊：

futex_abstimed_wait_cancelable,
  __pthread_cond_wait_common,
    __pthread_cond_timedwait,
      MDL_wait::timed_wait,
        TABLE_SHARE::wait_for_old_version,
          open_table,
            open_tables,
              open_tables_for_query,
                ::??,
                  mysql_execute_command,
                    mysql_parse,
                      dispatch_command,
                        do_command,
                          handle_connection,
                            pfs_spawn_thread,
                              start_thread,clone

　　很明顯的看到語句處於等待狀態，而且是wait for old version，看起來有點奇怪，那麼看看這個open_table 函式在幹嘛：

open_tables()
{
...
  if (!(flags & MYSQL_OPEN_IGNORE_FLUSH))
  {
    if (share->has_old_version()) // 如果存在 old_version
    {
      release_table_share(share);
      mysql_mutex_unlock(&LOCK_open);

      MDL_deadlock_handler mdl_deadlock_handler(ot_ctx);
      bool wait_result;
...

      wait_result= tdc_wait_for_old_version(thd, table_list->db,
                                            table_list->table_name,
                                            ot_ctx->get_timeout(),
                                            deadlock_weight);

      thd->pop_internal_handler();
...
    if (thd->open_tables && thd->open_tables->s->version != share->version)
    //如果存在不同的version，那麼需要釋放掉所有該表的cache，然後reopen
    {
      release_table_share(share);
      mysql_mutex_unlock(&LOCK_open);
      (void)ot_ctx->request_backoff_action(Open_table_context::OT_REOPEN_TABLES,
                                           NULL);
      DBUG_RETURN(TRUE);
    }
}
......

tdc_wait_for_old_version(THD *thd, const char *db, const char *table_name,
                         ulong wait_timeout, uint deadlock_weight)
{
  TABLE_SHARE *share;
  bool res= FALSE;

  mysql_mutex_lock(&LOCK_open);
  if ((share= get_cached_table_share(thd, db, table_name)) &&
      share->has_old_version())  
  //在這裡獲取表並進行表的version判斷，如果old_version一直存在的話，進入if程式碼
  {
    struct timespec abstime;
    set_timespec(&abstime, wait_timeout);
    res= share->wait_for_old_version(thd, &abstime, deadlock_weight);
  }
  mysql_mutex_unlock(&LOCK_open);
  return res;
}

可以看到 open_table 發現有 old_version 存在的時候，會呼叫 tdc_wait_for_old_version，如果這個表的 old_version 一直存在，則會一直等待。所以這個 select 語句其實一直處於等待狀態，等待 old_version 的表 cache 被釋放。

而這個 version，在 MySQL 中用來標記 table_defination 的版本，這個 version 更新了，則代表這個表的結構“發生了變化”，所有該表的 cache 都是失效的，不能再繼續使用。這個變數在 MySQL 中是refresh_version。

所以可以判斷出，analyze table 遞增了這個 refresh_version，雖然程式碼註釋中寫明瞭目前僅在 flush_table 的時候才會變更，不過測試環境中也只有 analyze 這個操作，聯絡 analyze 操作會嘗試獲取 flush 鎖，所以可能 analyze 在實現的時候也利用了 flush 的機制吧。

PS：實際上如果後來執行的不是 select，而是繼續對這個表進行 analyze 的話，也會被阻塞。

擴充套件一下

考慮到這個 old_version 的問題特點，可以拓展一下可能會遇到這個問題的場景：

analyze 肯定會遇到，因為案例都有了。
flush table 可能會遇到，因為也會遞增 refresh_version。
flush table with read lock 也會遇到，因為也是 flush 操作。
可能還有其他的場景，涉及到 table_defination 變化的，比如說 DDL？

MySQL 案例：analyze，慢查詢，與查詢無響應

問題描述

解決方案

問題還原

原因分析

一些準備知識

詳細的分析

擴充套件一下

MySQL 案例：analyze，慢查詢，與查詢無響應

MySQL查詢排序與查詢聚合函式用法分析

mysql慢查詢，mysql慢查詢日誌

Thinkphp做表關聯查詢：A表所有資料，並查詢關聯的B表的數量並累加A表字段做排序的案例

MYSQL學習：基礎查詢：去重，+號用法，連線字串

詳解MySQL資料庫--多表查詢--內連線，外連線，子查詢，相關子查詢

說說mysql和oracle他門的分頁查詢，分別是怎麼實現的?

同事問我MySQL怎麼遞迴查詢，我懵逼了

python連線mysql資料庫，並進行新增、查詢資料

MySql查詢，聚合函式，分組，分頁，排序等複雜查詢

mysql鎖表查詢，binlog日誌清理

2020-08-08：有一批氣象觀測站，現需要獲取這些站點的觀測資料，並存儲到 Hive 中。但是氣象局只提供了 api 查詢，每次只能查詢單個觀測點。那麼如果能夠方便快速地獲取到所有的觀測點的資料？

Python基礎第一個案例：猜數字遊戲，這個都寫不出，那就放棄吧

22丨案例：當磁碟引數導致IO高的時候，應該怎麼辦

oracle,Mysql 查詢，刪除重複資料

案例：私設小金庫，毀掉響水棉麻廠

MySQL 案例：analyze，慢查詢，與查詢無響應

問題描述

解決方案

問題還原

原因分析

一些準備知識

詳細的分析

擴充套件一下

相關推薦