記一次Lock wait timeout異常的排查過程
阿新 • • 發佈:2021-07-13
#########################
業務反映某一天這個Lock wait timeout異常特別嚴重,於是找到我來排查,這不很明顯的嗎?就是等待鎖超時異常了,但是呢,業務不知道哪個地方出現了鎖等待,也不清楚什麼sql導致的,拉一堆人馬在那裡等待我救命
於是,我迅雷不及掩耳之勢查找了mysql錯誤日誌和中介軟體日誌,業務也說了什麼時候出現的,因此直接定位到對應時間段周圍:
percona版本的mysql服務的錯誤日誌如下:
2021-07-13T14:21:27.442042+08:00 3262227 [Note] InnoDB: Lock wait timeout info: Requested thread id:3262227 Requested trx id: 48830402 Requested query: update apple set mac_last_heartbeat = 1626157276000 where ext_id = 105790 Total blocking transactions count: 1 Blocking transaction number: 1 Blocking thread id: 3232905 Blocking query id: 1092250130 Blocking trx id: 48830301
根據錯誤日誌得出:
1)時間是:2021-07-13T14:21:27.442042+08:00,
2)問題是:mysql的3262227會話執行緒被3232905會話執行緒給阻塞了
3)3262227會話執行緒上的什麼sql被阻塞了?3262227會話執行緒上的“update apple set mac_last_heartbeat = 1626157276000 where ext_id = 105790”這個sql語句被阻塞住了,
但是呢,這裡只知道3232905會話執行緒的id,其餘資訊對解決問題沒啥幫助,於是排查kingshard上的業務sql,中介軟體日誌記錄了業務發起sql的時間,執行時間,來源ip,目的ip,會話執行緒id,SQL語句,
因此根據3232905會話執行緒這個id去找到這個會話在該時間段周圍的所有sql,於是如下:
資料庫中介軟體kingshard的日誌如下:
2021/07/13 14:21:05 - OK - 1.7ms - 10.10.10.20:44465,conn_id=7479433|COMMIT 2021/07/13 14:21:05 - OK - 1.5ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|select current_timestamp 2021/07/13 14:21:05 - OK - 1.7ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|insert into apple (ext_id, int_id, txnid, mac_db, mac_table, mac_partition, mac_lock_state, mac_lock_type, mac_last_heartbeat, mac_user, mac_host) values (105790, 1,0, 'dw_business', 'dwd_ord_ord_df', null, 'w', 'e', 1626157265000, 'work', 'zjy-hadoop-prc-st2203.bj') 2021/07/13 14:21:05 - OK - 1.6ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|select ext_id, int_id, mac_db, mac_table, mac_partition, mac_lock_state, mac_lock_type, txnid from apple where ext_id = 105790 2021/07/13 14:21:05 - OK - 1.5ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|SAVEPOINT `__66a6cb85_17a856cffba__3b3e` 2021/07/13 14:21:05 - OK - 1.6ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|select ext_id, int_id, mac_db, mac_table, mac_partition, mac_lock_state, mac_lock_type, txnid from apple where mac_db in ('dw_business') and (mac_table is null or mac_table in('dwd_ord_ord_df')) and ext_id <= 105790 2021/07/13 14:21:05 - OK - 1.5ms - 10.10.10.20:44465,conn_id=7479433|ROLLBACK 2021/07/13 14:21:05 - OK - 0.0ms - 10.10.10.20:44465,conn_id=7479433|COMMIT 2021/07/13 14:21:05 - OK - 0.0ms - 10.10.10.20:44465,conn_id=7479433|ROLLBACK 2021/07/13 14:21:05 - OK - 0.0ms - 10.10.10.20:44465->0.0.0.0:6853,conn_id=7479433|SET autocommit=1 2021/07/13 14:21:07 - Connected - conn_id=7543980, autopilot@127.0.0.1:15509 () capability: 238213 2021/07/13 14:21:07 - Quit - conn_id=7543980, 127.0.0.1:15509 2021/07/13 14:21:11 - Connected - conn_id=7543981, autopilot@127.0.0.1:15667 () capability: 238213 2021/07/13 14:21:11 - Quit - conn_id=7543981, 127.0.0.1:15667 2021/07/13 14:21:14 - OK - 0.0ms - 10.10.10.20:43325->0.0.0.0:6853,conn_id=7309803|SET autocommit=0
### 上面日誌表示:
# 3232905這個會話執行緒先執行了select current_timestamp語句,
# 然後執行了insert into apple語句,
# 然後又鬼使神差地執行了SAVEPOINT `__66a6cb85_17a856cffba__3b3e`語句,
# 最後執行了一個select語句
# 然後就沒了然後
................
####下面這句是3262227會話執行緒的執行結果:ERROR,耗時10s多,來源ip是10.10.10.20,這個是業務方的服務所在機器的ip,目的ip為10.10.10.10,這個是mysql叢集的主庫ip地址,sql語句如下所示:
2021/07/13 14:21:27 - ERROR - 10981.2ms - 10.10.10.20:45954->10.10.10.10:3306,mysql_connect_id=3262227|update apple set mac_last_heartbeat = 1626157276000 where ext_id = 105790
########################
[email protected]