1. 程式人生 > 其它 >記一次Lock wait timeout異常的排查過程

記一次Lock wait timeout異常的排查過程

#########################

業務反映某一天這個Lock wait timeout異常特別嚴重,於是找到我來排查,這不很明顯的嗎?就是等待鎖超時異常了,但是呢,業務不知道哪個地方出現了鎖等待,也不清楚什麼sql導致的,拉一堆人馬在那裡等待我救命

於是,我迅雷不及掩耳之勢查找了mysql錯誤日誌和中介軟體日誌,業務也說了什麼時候出現的,因此直接定位到對應時間段周圍:

percona版本的mysql服務的錯誤日誌如下:

2021-07-13T14:21:27.442042+08:00 3262227 [Note] InnoDB: Lock wait timeout info:
Requested thread id: 
3262227 Requested trx id: 48830402 Requested query: update apple set mac_last_heartbeat = 1626157276000 where ext_id = 105790 Total blocking transactions count: 1 Blocking transaction number: 1 Blocking thread id: 3232905 Blocking query id: 1092250130 Blocking trx id: 48830301

根據錯誤日誌得出:

1)時間是:2021-07-13T14:21:27.442042+08:00,

2)問題是:mysql的3262227會話執行緒被3232905會話執行緒給阻塞了

3)3262227會話執行緒上的什麼sql被阻塞了?3262227會話執行緒上的“update apple set mac_last_heartbeat = 1626157276000 where ext_id = 105790”這個sql語句被阻塞住了,

但是呢,這裡只知道3232905會話執行緒的id,其餘資訊對解決問題沒啥幫助,於是排查kingshard上的業務sql,中介軟體日誌記錄了業務發起sql的時間,執行時間,來源ip,目的ip,會話執行緒id,SQL語句,

因此根據3232905會話執行緒這個id去找到這個會話在該時間段周圍的所有sql,於是如下:

資料庫中介軟體kingshard的日誌如下:

2021/07/13 14:21:05 - OK - 1.7ms - 10.10.10.20:44465,conn_id=7479433|COMMIT
2021/07/13 14:21:05 - OK - 1.5ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|select current_timestamp
2021/07/13 14:21:05 - OK - 1.7ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|insert into apple  (ext_id, 
                                   int_id, txnid, mac_db, mac_table, mac_partition, mac_lock_state, mac_lock_type, mac_last_heartbeat, mac_user, mac_host) 
                                   values (105790, 1,0, 'dw_business', 'dwd_ord_ord_df', null, 'w', 'e', 1626157265000, 'work', 'zjy-hadoop-prc-st2203.bj')
2021/07/13 14:21:05 - OK - 1.6ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|select ext_id, 
                                   int_id, mac_db, mac_table, mac_partition, mac_lock_state, mac_lock_type, txnid from apple where ext_id = 105790
2021/07/13 14:21:05 - OK - 1.5ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|SAVEPOINT `__66a6cb85_17a856cffba__3b3e`
2021/07/13 14:21:05 - OK - 1.6ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|select ext_id,
                                   int_id, mac_db, mac_table, mac_partition, mac_lock_state, mac_lock_type, txnid from apple where mac_db in
                                   ('dw_business') and (mac_table is null or mac_table in('dwd_ord_ord_df')) and ext_id <= 105790
2021/07/13 14:21:05 - OK - 1.5ms - 10.10.10.20:44465,conn_id=7479433|ROLLBACK
2021/07/13 14:21:05 - OK - 0.0ms - 10.10.10.20:44465,conn_id=7479433|COMMIT
2021/07/13 14:21:05 - OK - 0.0ms - 10.10.10.20:44465,conn_id=7479433|ROLLBACK
2021/07/13 14:21:05 - OK - 0.0ms - 10.10.10.20:44465->0.0.0.0:6853,conn_id=7479433|SET autocommit=1
2021/07/13 14:21:07 - Connected - conn_id=7543980, autopilot@127.0.0.1:15509 () capability: 238213
2021/07/13 14:21:07 - Quit - conn_id=7543980, 127.0.0.1:15509
2021/07/13 14:21:11 - Connected - conn_id=7543981, autopilot@127.0.0.1:15667 () capability: 238213
2021/07/13 14:21:11 - Quit - conn_id=7543981, 127.0.0.1:15667
2021/07/13 14:21:14 - OK - 0.0ms - 10.10.10.20:43325->0.0.0.0:6853,conn_id=7309803|SET autocommit=0

### 上面日誌表示:

# 3232905這個會話執行緒先執行了select current_timestamp語句,

# 然後執行了insert into apple語句,

# 然後又鬼使神差地執行了SAVEPOINT `__66a6cb85_17a856cffba__3b3e`語句,

# 最後執行了一個select語句

# 然後就沒了然後

................

####下面這句是3262227會話執行緒的執行結果:ERROR,耗時10s多,來源ip是10.10.10.20,這個是業務方的服務所在機器的ip,目的ip為10.10.10.10,這個是mysql叢集的主庫ip地址,sql語句如下所示:

2021/07/13 14:21:27 - ERROR - 10981.2ms - 10.10.10.20:45954->10.10.10.10:3306,mysql_connect_id=3262227|update apple set mac_last_heartbeat = 1626157276000 where ext_id = 105790

########################

[email protected]