1. 程式人生 > 實用技巧 >大表資料過濾查詢很慢

大表資料過濾查詢很慢

一、問題描述

查詢的語句類似如下:

select * from table_name where xxx='yyy' limit 10;

當前的hive表儲存格式是orc格式,執行引擎是tez,並行度也已經調整到幾十了,但是在執行這個sql的時候,發現一直卡住,執行不成功。

二、問題現象 and 分析:

現象:當前的查詢卡住。
分析:
1、檢視hiveserver2.log檔案,觀察當前的sql執行情況,發現當前處理sql的執行緒,一直在讀取資料檔案,如下:

看這個樣子,當前的sql自己就已經在scan資料了,完全沒有走mr任務,完全是本地就直接讀取了,相當於全表掃描,這種不慢才怪了。

2、通過jstack檢視執行緒的執行過程

3、通過explain分析執行計劃

三、問題解決

通過調整如下的引數:

hive.fetch.task.conversion

Some select queries can be converted to a single FETCH task, minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incur RS – ReduceSinkOperator, requiring a MapReduce task), lateral views and joins.

Supported values are none, minimal and more.
0. none: Disable hive.fetch.task.conversion
1. minimal: SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only
2. more: SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)

這個配置會嘗試將query轉換為一個fetch任務;

預設為more,將其改為none再執行上邊的sql,就會提交到yarn上執行;

hive.fetch.task.conversion.threshold

Input threshold (in bytes) for applying hive.fetch.task.conversion. If target table is native, input length is calculated by summation of file lengths. If it's not native, the storage handler for the table can optionally implement the org.apache.hadoop.hive.ql.metadata.InputEstimator interface. A negative threshold means hive.fetch.task.conversion is applied without any input length threshold.
預設為1073741824 (1 GB)

本文借鑑:
| https://www.cnblogs.com/barneywill/p/10109217.html