【原創】大資料基礎之Hive（1）Hive SQL執行過程

阿新 • • 發佈：2018-12-27

hive 2.1

hive執行sql有兩種方式：

執行hive命令，又細分為hive -e，hive -f，hive互動式；
執行beeline命令，beeline會連線遠端thrift server；

下面分別看這些場景下sql是怎樣被執行的：

1 hive命令

啟動命令

啟動hive客戶端命令

$HIVE_HOME/bin/hive

等價於

$HIVE_HOME/bin/hive --service cli

會呼叫

$HIVE_HOME/bin/ext/cli.sh

實際啟動類為：org.apache.hadoop.hive.cli.CliDriver

程式碼解析

org.apache.hadoop.hive.cli.CliDriver

  public static void main(String[] args) throws Exception {
    int ret = new CliDriver().run(args);
    System.exit(ret);
  }

  public  int run(String[] args) throws Exception {
...
    // execute cli driver work
    try {
      return executeDriver(ss, conf, oproc);
    }  
finally {
      ss.resetThreadName();
      ss.close();
    }
...

  private int executeDriver(CliSessionState ss, HiveConf conf, OptionsProcessor oproc)
      throws Exception {
...
    if (ss.execString != null) {
      int cmdProcessStatus = cli.processLine(ss.execString);
      return cmdProcessStatus;
    }
...
     
try {
      if (ss.fileName != null) {
        return cli.processFile(ss.fileName);
      }
    } catch (FileNotFoundException e) {
      System.err.println("Could not open input file for reading. (" + e.getMessage() + ")");
      return 3;
    }
...
    while ((line = reader.readLine(curPrompt + "> ")) != null) {
      if (!prefix.equals("")) {
        prefix += '\n';
      }
      if (line.trim().startsWith("--")) {
        continue;
      }
      if (line.trim().endsWith(";") && !line.trim().endsWith("\\;")) {
        line = prefix + line;
        ret = cli.processLine(line, true);
...

  public int processFile(String fileName) throws IOException {
...
      rc = processReader(bufferReader);
...

  public int processReader(BufferedReader r) throws IOException {
    String line;
    StringBuilder qsb = new StringBuilder();

    while ((line = r.readLine()) != null) {
      // Skipping through comments
      if (! line.startsWith("--")) {
        qsb.append(line + "\n");
      }
    }

    return (processLine(qsb.toString()));
  }
  
  public int processLine(String line, boolean allowInterrupting) {
...
        ret = processCmd(command);
...

  public int processCmd(String cmd) {
...
        CommandProcessor proc = CommandProcessorFactory.get(tokens, (HiveConf) conf);
        ret = processLocalCmd(cmd, proc, ss);
...

  int processLocalCmd(String cmd, CommandProcessor proc, CliSessionState ss) {
    int tryCount = 0;
    boolean needRetry;
    int ret = 0;

    do {
      try {
        needRetry = false;
        if (proc != null) {
          if (proc instanceof Driver) {
            Driver qp = (Driver) proc;
            PrintStream out = ss.out;
            long start = System.currentTimeMillis();
            if (ss.getIsVerbose()) {
              out.println(cmd);
            }

            qp.setTryCount(tryCount);
            ret = qp.run(cmd).getResponseCode();
...
              while (qp.getResults(res)) {
                for (String r : res) {
                  out.println(r);
                }
...

CliDriver.main會呼叫run，run會呼叫executeDriver，在executeDriver中對應上邊提到的三種情況：

一種是hive -e執行sql，此時ss.execString非空，執行完程序退出；
一種是hive -f執行sql檔案，此時ss.fileName非空，執行完程序退出；
一種是hive互動式執行sql，此時會不斷讀取reader.readLine，然後執行失去了並輸出結果；

上述三種情況最終都會呼叫processLine，processLine會呼叫processLocalCmd，在processLocalCmd中會先呼叫到Driver.run執行sql，執行完之後再呼叫Driver.getResults輸出結果，這也是Driver最重要的兩個介面，Driver實現後邊再看；

2 beeline命令

beeline需要連線到hive thrift server，先看hive thrift server如何啟動：

hive thrift server

啟動命令

啟動hive thrift server命令

$HIVE_HOME/bin/hiveserver2

等價於

$HIVE_HOME/bin/hive --service hiveserver2

會呼叫

$HIVE_HOME/bin/ext/hiveserver2.sh

實際啟動類為：org.apache.hive.service.server.HiveServer2

啟動過程

HiveServer2.main

         startHiveServer2

                  init

                          addService-CLIService,ThriftBinaryCLIService

                  start

                          Service.start

                                   CLIService.start

                                   ThriftBinaryCLIService.start

                                            TThreadPoolServer.serve

類結構：【介面或父類->子類】

TServer->TThreadPoolServer

         TProcessorFactory->SQLPlainProcessorFactory

                  TProcessor->TSetIpAddressProcessor

                          ThriftCLIService->ThriftBinaryCLIService

                                   CLIService

                                            HiveSession

程式碼解析

org.apache.hive.service.cli.thrift.ThriftBinaryCLIService

  public ThriftBinaryCLIService(CLIService cliService, Runnable oomHook) {
    super(cliService, ThriftBinaryCLIService.class.getSimpleName());
    this.oomHook = oomHook;
  }

ThriftBinaryCLIService是一個核心類，其中會實際啟動thrift server，同時包裝一個CLIService，請求最後都會呼叫底層的CLIService處理，下面看CLIService程式碼：

org.apache.hive.service.cli.CLIService

  @Override
  public OperationHandle executeStatement(SessionHandle sessionHandle, String statement,
      Map<String, String> confOverlay) throws HiveSQLException {
    OperationHandle opHandle =
        sessionManager.getSession(sessionHandle).executeStatement(statement, confOverlay);
    LOG.debug(sessionHandle + ": executeStatement()");
    return opHandle;
  }
  
  @Override
  public RowSet fetchResults(OperationHandle opHandle, FetchOrientation orientation,
                             long maxRows, FetchType fetchType) throws HiveSQLException {
    RowSet rowSet = sessionManager.getOperationManager().getOperation(opHandle)
        .getParentSession().fetchResults(opHandle, orientation, maxRows, fetchType);
    LOG.debug(opHandle + ": fetchResults()");
    return rowSet;
  }

CLIService最重要的兩個介面，一個是executeStatement，一個是fetchResults，兩個介面都會轉發給HiveSession處理，下面看HiveSession實現類程式碼：

org.apache.hive.service.cli.session.HiveSessionImpl

  @Override
  public OperationHandle executeStatement(String statement, Map<String, String> confOverlay) throws HiveSQLException {
    return executeStatementInternal(statement, confOverlay, false, 0);
  }

  private OperationHandle executeStatementInternal(String statement,
      Map<String, String> confOverlay, boolean runAsync, long queryTimeout) throws HiveSQLException {
    acquire(true, true);

    ExecuteStatementOperation operation = null;
    OperationHandle opHandle = null;
    try {
      operation = getOperationManager().newExecuteStatementOperation(getSession(), statement,
          confOverlay, runAsync, queryTimeout);
      opHandle = operation.getHandle();
      operation.run();
...
  @Override
  public RowSet fetchResults(OperationHandle opHandle, FetchOrientation orientation,
      long maxRows, FetchType fetchType) throws HiveSQLException {
    acquire(true, false);
    try {
      if (fetchType == FetchType.QUERY_OUTPUT) {
        return operationManager.getOperationNextRowSet(opHandle, orientation, maxRows);
      }
      return operationManager.getOperationLogRowSet(opHandle, orientation, maxRows, sessionConf);
    } finally {
      release(true, false);
    }
  }

可見

HiveSessionImpl.executeStatement是呼叫ExecuteStatementOperation.run（ExecuteStatementOperation是Operation的一種）
HiveSessionImpl.fetchResults是呼叫OperationManager.getOperationNextRowSet，然後會呼叫到Operation.getNextRowSet

org.apache.hive.service.cli.operation.OperationManager

  public RowSet getOperationNextRowSet(OperationHandle opHandle,
      FetchOrientation orientation, long maxRows)
          throws HiveSQLException {
    return getOperation(opHandle).getNextRowSet(orientation, maxRows);
  }

下面寫詳細看Operation的run和getOperationNextRowSet：

org.apache.hive.service.cli.operation.Operation

  public void run() throws HiveSQLException {
    beforeRun();
    try {
      Metrics metrics = MetricsFactory.getInstance();
      if (metrics != null) {
        try {
          metrics.incrementCounter(MetricsConstant.OPEN_OPERATIONS);
        } catch (Exception e) {
          LOG.warn("Error Reporting open operation to Metrics system", e);
        }
      }
      runInternal();
    } finally {
      afterRun();
    }
  }
  
  public RowSet getNextRowSet() throws HiveSQLException {
    return getNextRowSet(FetchOrientation.FETCH_NEXT, DEFAULT_FETCH_MAX_ROWS);
  }

Operation是一個抽象類，

run會呼叫抽象方法runInternal
getNextRowSet會呼叫抽象方法getNextRowSet

下面會看到這兩個抽象方法在子類中的實現，最終會依賴Driver的run和getResults；

1）先看runInternal在子類HiveCommandOperation中被實現：

org.apache.hive.service.cli.operation.HiveCommandOperation

  @Override
  public void runInternal() throws HiveSQLException {
    setState(OperationState.RUNNING);
    try {
      String command = getStatement().trim();
      String[] tokens = statement.split("\\s");
      String commandArgs = command.substring(tokens[0].length()).trim();

      CommandProcessorResponse response = commandProcessor.run(commandArgs);
...

這裡會呼叫CommandProcessor.run，實際會呼叫Driver.run（Driver是CommandProcessor的實現類）；

2）再看getNextRowSet在子類SQLOperation中被實現：

org.apache.hive.service.cli.operation.SQLOperation

  public RowSet getNextRowSet(FetchOrientation orientation, long maxRows)
    throws HiveSQLException {
...
      driver.setMaxRows((int) maxRows);
      if (driver.getResults(convey)) {
        return decode(convey, rowSet);
      }
...

這裡會呼叫Driver.getResults；

3 Driver

通過上面的程式碼分析發現無論是hive命令列執行還是beeline連線thrift server執行，最終都會依賴Driver，

Driver最核心的兩個介面：

run
getResults

程式碼解析

org.apache.hadoop.hive.ql.Driver

  @Override
  public CommandProcessorResponse run(String command)
      throws CommandNeedRetryException {
    return run(command, false);
  }
  
  public CommandProcessorResponse run(String command, boolean alreadyCompiled)
        throws CommandNeedRetryException {
    CommandProcessorResponse cpr = runInternal(command, alreadyCompiled);
...
  private CommandProcessorResponse runInternal(String command, boolean alreadyCompiled)
      throws CommandNeedRetryException {
...
        ret = compileInternal(command, true);
...
      ret = execute(true);
...
  private int compileInternal(String command, boolean deferClose) {
...
      ret = compile(command, true, deferClose);
...
  public int compile(String command, boolean resetTaskIds, boolean deferClose) {
...
      plan = new QueryPlan(queryStr, sem, perfLogger.getStartTime(PerfLogger.DRIVER_RUN), queryId,
        queryState.getHiveOperation(), schema);
...
  public int execute(boolean deferClose) throws CommandNeedRetryException {
...
      // Add root Tasks to runnable
      for (Task<? extends Serializable> tsk : plan.getRootTasks()) {
        // This should never happen, if it does, it's a bug with the potential to produce
        // incorrect results.
        assert tsk.getParentTasks() == null || tsk.getParentTasks().isEmpty();
        driverCxt.addToRunnable(tsk);
      }
...
      // Loop while you either have tasks running, or tasks queued up
      while (driverCxt.isRunning()) {

        // Launch upto maxthreads tasks
        Task<? extends Serializable> task;
        while ((task = driverCxt.getRunnable(maxthreads)) != null) {
          TaskRunner runner = launchTask(task, queryId, noName, jobname, jobs, driverCxt);
          if (!runner.isRunning()) {
            break;
          }
        }

        // poll the Tasks to see which one completed
        TaskRunner tskRun = driverCxt.pollFinished();
        if (tskRun == null) {
          continue;
        }
        hookContext.addCompleteTask(tskRun);
        queryDisplay.setTaskResult(tskRun.getTask().getId(), tskRun.getTaskResult());

        Task<? extends Serializable> tsk = tskRun.getTask();
        TaskResult result = tskRun.getTaskResult();
...
        if (tsk.getChildTasks() != null) {
          for (Task<? extends Serializable> child : tsk.getChildTasks()) {
            if (DriverContext.isLaunchable(child)) {
              driverCxt.addToRunnable(child);
            }
          }
        }
      }

  public boolean getResults(List res) throws IOException, CommandNeedRetryException {
    if (driverState == DriverState.DESTROYED || driverState == DriverState.CLOSED) {
      throw new IOException("FAILED: query has been cancelled, closed, or destroyed.");
    }

    if (isFetchingTable()) {
      /**
       * If resultset serialization to thrift object is enabled, and if the destination table is
       * indeed written using ThriftJDBCBinarySerDe, read one row from the output sequence file,
       * since it is a blob of row batches.
       */
      if (fetchTask.getWork().isUsingThriftJDBCBinarySerDe()) {
        maxRows = 1;
      }
      fetchTask.setMaxRows(maxRows);
      return fetchTask.fetch(res);
    }
...

Driver的run會呼叫runInternal，runInternal中會先compileInternal編譯sql並生成QueryPlan，然後呼叫execute執行QueryPlan中的所有task；
Driver的getResults會呼叫FetchTask的fetch來獲取結果；

這兩個過程細節較多，後邊單獨發文討論。

【原創】大資料基礎之Spark（4）RDD原理及程式碼解析

一簡介 spark核心是RDD，官方文件地址：https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds官方描述如下：重點是可容錯，可並行處理 Spark r

【原創】大資料基礎之Spark（5）Shuffle實現原理及程式碼解析

一簡介 Shuffle，簡而言之，就是對資料進行重新分割槽，其中會涉及大量的網路io和磁碟io，為什麼需要shuffle，以詞頻統計reduceByKey過程為例， serverA：partition1: (hello, 1), (word, 1)serverB：partition2: (hell

【原創】大資料基礎之Spark（6）rdd sort實現原理

spark 2.1.1 spark中可以通過RDD.sortBy來對分散式資料進行排序，具體是如何實現的？來看程式碼： org.apache.spark.rdd.RDD /** * Return this RDD sorted by the given key function.

【原創】大資料基礎之Spark（7）spark讀取檔案split過程（即RDD分割槽數量）

spark 2.1.1 spark初始化rdd的時候，需要讀取檔案，通常是hdfs檔案，在讀檔案的時候可以指定最小partition數量，這裡只是建議的數量，實際可能比這個要大（比如檔案特別多或者特別大時），也可能比這個要小（比如檔案只有一個而且很小時），如果沒有指定最小partition數量，初始化完成的

【原創】運維基礎之Ansible（1）簡介、安裝和使用

ets 安裝 yum ant gem get 結構 ges describe 官方：https://www.ansible.com/ 一簡介 Ansible is a radically simple IT automation engine that automate

【原創】運維基礎之Nginx（1）簡介、安裝、使用

官方：http://nginx.org nginx [engine x] is an HTTP and reverse proxy server, a mail proxy server, and a generic TCP/UDP proxy server, originally written by

【原創】大數據基礎之Kudu（1）簡介、安裝

變化決策 leader 通用修改 amp use case 容錯性 stream kudu 1.7 官方：https://kudu.apache.org/ 一簡介 kudu有很多概念，有分布式文件系統（HDFS），有一致性算法（Zookeeper），有Table

【原創】算法基礎之Anaconda（1）簡介、安裝、使用

https orf ati 2.7 容易 ice range gcc x86_64 Anaconda 2 官方：https://www.anaconda.com/ 一簡介 The Most Popular Python Data Science Platform A

【原創】大數據基礎之Mesos（1）簡介、安裝、使用

物理 variable 服務器集群 ast 過程 ould task pos 編譯 Mesos 1.7.1 官方：http://mesos.apache.org/ 一簡介 Program against your datacenter like it’s a sin

【原創】運維基礎之Redis（1）簡介、安裝、使用

lists 腳本分享 ngs 參考 ports eos 運維基礎 lru redis 5.0.3 官方：https://redis.io/ 一簡介 Redis is an open source (BSD licensed), in-memory data str

【原創】大數據基礎之Presto（1）簡介、安裝、使用

epo embedded mach img ans 公司 mkdir redis running presto 0.217 官方：http://prestodb.github.io/ 一簡介 Presto is an open source distrib

【原創】大資料基礎之Hive（1）Hive SQL執行過程

hive 2.1 hive執行sql有兩種方式：執行hive命令，又細分為hive -e，hive -f，hive互動式；執行beeline命令，beeline會連線遠端thrift server；下面分別看這些場景下sql是怎樣被執行的： 1 hive命令啟動

【原創】大數據基礎之Benchmark（4）TPC-DS測試結果（hive spark impala）

內存 1.5 測試數據大數據基礎 .com cpu mas exe apr 1 測試集群內存：256GCPU：32Core （Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz）Disk（系統盤）：300GDisk（數據盤）：1.5T*

【原創】大數據基礎之Spark（4）RDD原理及代碼解析

sso 數據 queue running upd parallel input gettime side 一簡介 spark核心是RDD，官方文檔地址：https://spark.apache.org/docs/latest/rdd-programming-guide.h

【原創】大數據基礎之Spark（7）spark讀取文件split過程（即RDD分區數量）

ali ces ORC row mapreduce 獲取 sse repo 大致 spark 2.1.1 spark初始化rdd的時候，需要讀取文件，通常是hdfs文件，在讀文件的時候可以指定最小partition數量，這裏只是建議的數量，實際可能比這個要大（比如文件特別多

【原創】大數據基礎之Spark（9）spark部署方式yarn/mesos

cli 原創 container 大數據 per containe ber exe 調整 1 下載 https://spark.apache.org/downloads.html $ wget http://mirrors.shu.edu.cn/apache/spar

【原創】大數據基礎之ElasticSearch（5）重要配置及調優

acc del refresh part closed efault end read_only bsp Index Settings 重要索引配置 Index level settings can be set per-index. Settings may be:

【原創】大數據基礎之Logstash（4）高可用

htm 無法 sep fsync sage tin www cert upd logstash高可用體現為不丟數據（前提為服務器短時間內不可用後可恢復比如重啟服務器或重啟進程），具體有兩個方面：進程重啟（服務器重啟）事件消息處理失敗在logstash中對

【原創】運維基礎之Nginx（3）location

大小 uri 規則 ati 第一個基礎匹配規則最大 cati nginx location =：精確匹配（必須全部相等） ~：大小寫敏感，正則匹配 ~*：忽略大小寫，正則匹配 ^~：只需匹配uri部分，精確匹配 @：內部服務跳轉，精確匹配規則

大資料基礎之Quartz（1）簡介、原始碼解析

一簡介官網 http://www.quartz-scheduler.org/ What is the Quartz Job Scheduling Library? Quartz is a richly featured, open source job scheduling libra

【原創】大資料基礎之Hive（1）Hive SQL執行過程

1 hive命令

啟動命令

程式碼解析

2 beeline命令

hive thrift server

啟動命令

啟動過程

類結構：【介面或父類->子類】

程式碼解析

3 Driver

程式碼解析

相關推薦