1. 程式人生 > >Hive:簡單查詢不啟用Mapreduce job而啟用Fetch task

Hive:簡單查詢不啟用Mapreduce job而啟用Fetch task

 一、背景:

       如果在hive中僅僅查詢某個表的一列,Hive也會預設啟用MapReduce Job來完成這個任務。我們都知道,啟用MapReduce Job是會消耗系統開銷的。對於這個問題,從Hive0.10.0版本開始,對於簡單的查詢語句(沒有函式、排序、不需要聚合的查詢語句),類似SELECT <col> from <table> LIMIT n語句,當開啟Fetch Task功能,就執行一個簡單的查詢語句不會生成MapRreduce作業,而是直接使用Fetch Task,從hdfs檔案系統中進行查詢輸出資料,從而提高效率。

二、配置Fetch Task的方法

1、在hive提示符

hive> set hive.fetch.task.conversion=more;

2、啟動hive時,加入引數 

bin/hive --hiveconf hive.fetch.task.conversion=more

3、修改 hive-site.xml檔案 ,加入屬性,儲存退出。
上面的兩種方法都可以開啟了Fetch任務,但是都是臨時起作用的;如果你想一直啟用這個功能,可以在${HIVE_HOME}/conf/hive-site.xml裡面加入以下配置:

 <property>

  <name>hive.fetch.task.conversion</name>

  <value>more</value>

  <description>

    Some select queries can be converted to single FETCH task

    minimizing latency.Currently the query should be single

    sourced not having any subquery and should not have

    any aggregations or distincts (which incurrs RS),

    lateral views and joins.

    1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only

    2. more   : SELECT, FILTER, LIMIT only (+TABLESAMPLE, virtual columns)

  </description>

</property> 

三、舉例說明:

1、沒有配置Fetch Task,預設啟用MapReduce job完成任務。

hive> select id from t ;                 
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there is no reduce operator
Starting Job = job_1402248601715_0004, Tracking URL = http://cdh1:8088/proxy/application_1402248601715_0004/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1402248601715_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-06-09 11:12:54,817 Stage-1 map = 0%,  reduce = 0%
2014-06-09 11:13:15,790 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.96 sec
2014-06-09 11:13:16,982 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.96 sec
MapReduce Total cumulative CPU time: 2 seconds 960 msec
Ended Job = job_1402248601715_0004
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 2.96 sec   HDFS Read: 257 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 960 msec
OK
Time taken: 51.496 seconds

檢視上面執行日誌,可以看到該次查詢啟動了mapreduce任務,mapper數為1,沒有reducer任務。

2、配置fetch task,用到 hive.fetch.task.conversion 引數:

<property>
  <name>hive.fetch.task.conversion</name>
  <value>minimal</value>
  <description>
    Some select queries can be converted to single FETCH task 
    minimizing latency.Currently the query should be single 
    sourced not having any subquery and should not have
    any aggregations or distincts (which incurrs RS), 
    lateral views and joins.
    1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
    2. more    : SELECT, FILTER, LIMIT only (+TABLESAMPLE, virtual columns)
  </description>
</property>

       該引數預設值為minimal,表示執行“select * ”並帶有limit查詢時候,會將其轉換為FetchTask;如果引數值為more,則select某一些列並帶有limit條件時,也會將其轉換為FetchTask任務。當然,還有前提條件:單一資料來源,即輸入來源一個表或者分割槽;沒有子查詢;沒有聚合運算和distinct;不能用於檢視和join。

測試一下,先講其引數值設為more,再執行:

hive> set hive.fetch.task.conversion=more;
hive> select id from t limit 1;           
OK
Time taken: 0.242 seconds
hive> select id from t ;                  
OK
Time taken: 0.496 seconds

t表是一個沒有資料空表