1. 程式人生 > >spark SQL 執行過程

spark SQL 執行過程

1、程式碼實現

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
//case class一定要放到外面
case class Person(id: Int, name: String, age: Int)
object InferringSchema {
  def main(args: Array[String]) {
    //建立SparkConf()並設定App名稱
val conf = new SparkConf().setAppName("SQL-1")
    //SQLContext要依賴SparkContext
val sc = new SparkContext(conf) //建立SQLContext val sqlContext = new SQLContext(sc) //從指定的地址建立RDD val lineRDD = sc.textFile(args(0)).map(_.split(",")) //建立case class //將RDD和case class關聯 val personRDD = lineRDD.map(x => Person(x(0).toInt, x(1), x(2).toInt)) //匯入隱式轉換,如果不到人無法將RDD轉換成DataFrame
//將RDD轉換成DataFrame import sqlContext.implicits._ val personDF = personRDD.toDF //登錄檔 personDF.registerTempTable("t_person") //傳入SQL val df = sqlContext.sql("select * from t_person order by age desc limit 2") //將結果以JSON的方式儲存到指定位置 df.write.json(args(1)) //停止Spark Context sc.stop() } }
2、命令

./spark-submit  --class InferringSchema --master spark://hadoop1:7077,hadoop2:7077 ../sparkJar/sparkSql.jar  hdfs://hadoop1:9000/sparkSql  hdfs://hadoop1:9000/spark

3、檢視執行過程

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/11/04 23:42:52 INFO SparkContext: Running Spark version 1.6.2
16/11/04 23:42:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/11/04 23:42:53 INFO SecurityManager: Changing view acls to: root
16/11/04 23:42:53 INFO SecurityManager: Changing modify acls to: root

16/11/04 23:42:53 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/11/04 23:42:55 INFO Utils: Successfully started service 'sparkDriver' on port 51933.
16/11/04 23:42:56 INFO Slf4jLogger: Slf4jLogger started
16/11/04 23:42:56 INFO Remoting: Starting remoting
16/11/04 23:42:56 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:44880]
16/11/04 23:42:56 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 44880.
16/11/04 23:42:56 INFO SparkEnv: Registering MapOutputTracker
16/11/04 23:42:56 INFO SparkEnv: Registering BlockManagerMaster
16/11/04 23:42:56 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-c05d1a76-3d2b-4912-b371-7fd8e747445c
16/11/04 23:42:56 INFO MemoryStore: MemoryStore started with capacity 517.4 MB
16/11/04 23:42:57 INFO SparkEnv: Registering OutputCommitCoordinator
16/11/04 23:42:57 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/11/04 23:42:57 INFO SparkUI: Started SparkUI at http://192.168.215.133:4040

16/11/04 23:42:57 INFO HttpFileServer: HTTP File server directory is /tmp/spark-4b2b30f5-6bcb-485c-b619-44a9015257e8/httpd-b4a23a5e-0127-406c-888f-a44d7b4a17ac
16/11/04 23:42:58 INFO HttpServer: Starting HTTP Server
16/11/04 23:42:58 INFO Utils: Successfully started service 'HTTP file server' on port 48515.
16/11/04 23:42:58 INFO SparkContext: Added JAR file:/usr/local/spark/bin/../sparkJar/sparkSql.jar at http://192.168.215.133:48515/jars/sparkSql.jar with timestamp 1478328178593
16/11/04 23:42:58 INFO AppClient$ClientEndpoint: Connecting to master spark://hadoop1:7077...
16/11/04 23:42:58 INFO AppClient$ClientEndpoint: Connecting to master spark://hadoop2:7077...

16/11/04 23:42:59 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20161104234259-0008
16/11/04 23:42:59 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54971.
16/11/04 23:42:59 INFO NettyBlockTransferService: Server created on 54971
16/11/04 23:42:59 INFO AppClient$ClientEndpoint: Executor added: app-20161104234259-0008/0 on worker-20161104194431-192.168.215.132-59350 (192.168.215.132:59350) with 1 cores
16/11/04 23:42:59 INFO SparkDeploySchedulerBackend: Granted executor ID app-20161104234259-0008/0 on hostPort 192.168.215.132:59350 with 1 cores, 1024.0 MB RAM
16/11/04 23:42:59 INFO BlockManagerMaster: Trying to register BlockManager
16/11/04 23:42:59 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.215.133:54971 with 517.4 MB RAM, BlockManagerId(driver, 192.168.215.133, 54971)
16/11/04 23:42:59 INFO BlockManagerMaster: Registered BlockManager
16/11/04 23:42:59 INFO AppClient$ClientEndpoint: Executor updated: app-20161104234259-0008/0 is now RUNNING
16/11/04 23:42:59 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
16/11/04 23:43:02 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 153.6 KB, free 153.6 KB)
16/11/04 23:43:02 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 167.5 KB)
16/11/04 23:43:02 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.215.133:54971 (size: 13.9 KB, free: 517.4 MB)
16/11/04 23:43:02 INFO SparkContext: Created broadcast 0 from textFile at SQLDemo1.scala:17
16/11/04 23:43:08 INFO SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (hadoop3:52754) with ID 0
16/11/04 23:43:08 INFO BlockManagerMasterEndpoint: Registering block manager hadoop3:39995 with 517.4 MB RAM, BlockManagerId(0, hadoop3, 39995)
16/11/04 23:43:14 INFO FileInputFormat: Total input paths to process : 1
16/11/04 23:43:14 INFO SparkContext: Starting job: json at SQLDemo1.scala:30
16/11/04 23:43:14 INFO DAGScheduler: Got job 0 (json at SQLDemo1.scala:30) with 2 output partitions
16/11/04 23:43:14 INFO DAGScheduler: Final stage: ResultStage 0 (json at SQLDemo1.scala:30)
16/11/04 23:43:14 INFO DAGScheduler: Parents of final stage: List()
16/11/04 23:43:14 INFO DAGScheduler: Missing parents: List()
16/11/04 23:43:14 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[8] at json at SQLDemo1.scala:30), which has no missing parents
16/11/04 23:43:14 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.9 KB, free 175.4 KB)
16/11/04 23:43:14 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.1 KB, free 179.5 KB)
16/11/04 23:43:14 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.215.133:54971 (size: 4.1 KB, free: 517.4 MB)
16/11/04 23:43:14 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/11/04 23:43:14 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[8] at json at SQLDemo1.scala:30)
16/11/04 23:43:14 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/11/04 23:43:14 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, hadoop3, partition 0,NODE_LOCAL, 2199 bytes)
16/11/04 23:43:22 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on hadoop3:39995 (size: 4.1 KB, free: 517.4 MB)
16/11/04 23:43:24 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop3:39995 (size: 13.9 KB, free: 517.4 MB)
16/11/04 23:43:30 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, hadoop3, partition 1,NODE_LOCAL, 2199 bytes)
16/11/04 23:43:30 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 15642 ms on hadoop3 (1/2)
16/11/04 23:43:30 INFO DAGScheduler: ResultStage 0 (json at SQLDemo1.scala:30) finished in 15.695 s
16/11/04 23:43:30 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 116 ms on hadoop3 (2/2)
16/11/04 23:43:30 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
16/11/04 23:43:30 INFO DAGScheduler: Job 0 finished: json at SQLDemo1.scala:30, took 15.899557 s
16/11/04 23:43:30 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/11/04 23:43:30 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/11/04 23:43:30 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/11/04 23:43:30 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/11/04 23:43:30 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/11/04 23:43:30 INFO DefaultWriterContainer: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
16/11/04 23:43:30 INFO SparkContext: Starting job: json at SQLDemo1.scala:30
16/11/04 23:43:30 INFO DAGScheduler: Got job 1 (json at SQLDemo1.scala:30) with 1 output partitions
16/11/04 23:43:30 INFO DAGScheduler: Final stage: ResultStage 1 (json at SQLDemo1.scala:30)
16/11/04 23:43:30 INFO DAGScheduler: Parents of final stage: List()
16/11/04 23:43:30 INFO DAGScheduler: Missing parents: List()
16/11/04 23:43:30 INFO DAGScheduler: Submitting ResultStage 1 (ParallelCollectionRDD[9] at json at SQLDemo1.scala:30), which has no missing parents
16/11/04 23:43:30 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 63.2 KB, free 242.7 KB)
16/11/04 23:43:30 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 21.5 KB, free 264.2 KB)
16/11/04 23:43:30 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.215.133:54971 (size: 21.5 KB, free: 517.4 MB)
16/11/04 23:43:30 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/11/04 23:43:30 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ParallelCollectionRDD[9] at json at SQLDemo1.scala:30)
16/11/04 23:43:30 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
16/11/04 23:43:30 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, hadoop3, partition 0,PROCESS_LOCAL, 2537 bytes)
16/11/04 23:43:30 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on hadoop3:39995 (size: 21.5 KB, free: 517.4 MB)
16/11/04 23:43:31 INFO DAGScheduler: ResultStage 1 (json at SQLDemo1.scala:30) finished in 1.024 s
16/11/04 23:43:31 INFO DAGScheduler: Job 1 finished: json at SQLDemo1.scala:30, took 1.175741 s
16/11/04 23:43:31 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 1026 ms on hadoop3 (1/1)
16/11/04 23:43:31 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
16/11/04 23:43:32 INFO ContextCleaner: Cleaned accumulator 3
16/11/04 23:43:32 INFO BlockManagerInfo: Removed broadcast_2_piece0 on hadoop3:39995 in memory (size: 21.5 KB, free: 517.4 MB)
16/11/04 23:43:32 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 192.168.215.133:54971 in memory (size: 21.5 KB, free: 517.4 MB)
16/11/04 23:43:32 INFO ContextCleaner: Cleaned accumulator 4
16/11/04 23:43:32 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.215.133:54971 in memory (size: 4.1 KB, free: 517.4 MB)
16/11/04 23:43:32 INFO BlockManagerInfo: Removed broadcast_1_piece0 on hadoop3:39995 in memory (size: 4.1 KB, free: 517.4 MB)
16/11/04 23:43:32 INFO DefaultWriterContainer: Job job_201611042343_0000 committed.
16/11/04 23:43:32 INFO JSONRelation: Listing hdfs://hadoop1:9000/sparktest4 on driver
16/11/04 23:43:32 INFO JSONRelation: Listing hdfs://hadoop1:9000/sparktest4 on driver

16/11/04 23:43:32 INFO SparkUI: Stopped Spark web UI at http://192.168.215.133:4040
16/11/04 23:43:32 INFO SparkDeploySchedulerBackend: Shutting down all executors
16/11/04 23:43:32 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
16/11/04 23:43:32 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/11/04 23:43:32 INFO MemoryStore: MemoryStore cleared
16/11/04 23:43:32 INFO BlockManager: BlockManager stopped
16/11/04 23:43:32 INFO BlockManagerMaster: BlockManagerMaster stopped
16/11/04 23:43:32 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/11/04 23:43:32 INFO SparkContext: Successfully stopped SparkContext
16/11/04 23:43:32 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/11/04 23:43:32 INFO ShutdownHookManager: Shutdown hook called
16/11/04 23:43:32 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/11/04 23:43:32 INFO ShutdownHookManager: Deleting directory /tmp/spark-4b2b30f5-6bcb-485c-b619-44a9015257e8
16/11/04 23:43:32 INFO ShutdownHookManager: Deleting directory /tmp/spark-4b2b30f5-6bcb-485c-b619-44a9015257e8/httpd-b4a23a5e-0127-406c-888f-a44d7b4a17ac

4、web介面檢視執行過程





相關推薦

spark SQL 執行過程

1、程式碼實現 import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} //case class一定要放到外面 case class Person(i

Mybatis之是如何執行你的SQL的(SQL執行過程,引數解析過程,結果集封裝過程

Myabtis的SQL的執行是通過SqlSession。預設的實現類是DefalutSqlSession。通過原始碼可以發現,selectOne最終會呼叫selectList這個方法。 1 @Override 2 public <E> List<E> select

SQL執行過程

工業編碼中sql是不可或缺的,編寫或優化出高效率的SQL是碼農的重要工作,瞭解SQL的處理過程可以讓我們與資料庫互動時遊刃有餘。下面是oracle資料庫處理SQL的過程。 SQL處理 SQL處理是SQL語句的解析、優化、行源生成和執行。為了更快的處理資料,資料庫會做一些快取,從而省略

Spark任務執行過程簡介

--executor-memory 每一個executor使用的記憶體大小 --total-executor-cores    整個application使用的核數 1.提交一個spark程式到spark叢集,會產生哪些程序?     

Spark修煉之道(進階篇)——Spark入門到精通:第九節 Spark SQL執行流程解析

1.整體執行流程 使用下列程式碼對SparkSQL流程進行分析,讓大家明白LogicalPlan的幾種狀態,理解SparkSQL整體執行流程 // sc is an existing SparkContext. val sqlContext = new or

SQL執行過程中的效能負載點

一、SQL執行過程   1、使用者連線資料庫,執行SQL語句;   2、先在記憶體進行記憶體讀,找到了所需資料就直接交給使用者工作空間;   3、記憶體讀失敗,也就說在記憶體中沒找到支援SQL所需資料,就進行物理讀,也就是到磁碟中查詢;   4、找到的資料放到記憶體中,在記憶體進行資料過

【原創】大資料基礎之Hive(1)Hive SQL執行過程

hive 2.1   hive執行sql有兩種方式: 執行hive命令,又細分為hive -e,hive -f,hive互動式; 執行beeline命令,beeline會連線遠端thrift server; 下面分別看這些場景下sql是怎樣被執行的: 1 hive命令 啟動

spark任務執行過程的原始碼分析

spark任務執行的原始碼分析 在整個spark任務的編寫、提交、執行分三個部分:① 編寫程式和提交任務到叢集中 ②sparkContext的初始化③觸發action運算元中的runJob方法,執行任務 (1)程式設計程式並提交到叢集: ①程式設計spark程式的程式碼②打成jar包到叢集中執行③使用s

explain命令檢視SQL執行過程的結果分析

最近慢慢接觸MySQL,瞭解如何優化它也迫在眉睫了,話說工欲善其事,必先利其器。最近我就打算了解下幾個優化MySQL中經常用到的工具。今天就簡單介紹下EXPLAIN。內容導航idselect_typetabletypepossible_keyskeykey_lenrefrow

SQL 執行過程

1.查詢中用到的關鍵詞主要包含六個,並且他們的順序依次為  select--from--where--group by--having--order by  其中select和from是必須的,其他關鍵詞是可選的,這六個關鍵詞的執行順序  與sql語句的書寫順序並不是一

sql執行過程分析

我們總是寫sql語句,資料庫把結果返回給我們,那中間過程又是什麼?如果瞭解oracle是怎麼執行sql語句的中間過程,對我們優化sql有很大的幫助 首先了解一下執行sql,需要消耗什麼資源, cpu, 記憶體, io, 我們要了解什麼情況下會消耗cpu,什麼情況下消耗記憶體,

Oracle 共享池和資料庫高速緩衝區,引出SQL執行過程

        共享池在資料庫中可以說是相當重要動力資源,關係著資料庫的效能瓶頸。 什麼是共享池呢?         共享池是記憶體結構中SGA(系統全域性區)的一部分,包含了:庫緩衝區、資料字典緩衝區、伺服器結果緩衝區、預留池,也是著四個區組成了共享池,這四個區的功能就是共享池的功能。         庫

(轉)logback 打印Mybitis中的sql執行過程

values tis nav ole 決定 ret 閱讀 factor ins 閱讀目錄 1 不同版本的Mybitis對應不同的控制策略 場景:在程序開發過程中經常需要跟蹤程序中sql語句的執行過程,在控制臺打印出sql語句和對應的參數傳遞就能夠更快的定

mysql中SQL執行過程詳解

 mysql執行一個查詢的過程,到底做了些什麼:   客戶端傳送一條查詢給伺服器; 伺服器先

spark sql 執行計劃生成案例

前言     一個SQL從詞法解析、語法解析、邏輯執行計劃、物理執行計劃最終轉換為可以執行的RDD,中間經歷了很多的步驟和流程。其中詞法分析和語法分析均有ANTLR4完成,可以進一步學習ANTLR4的相關知識做進一步瞭解。     本篇文章主要對一個簡單的SQL生成的邏

精盡MyBatis原始碼分析 - SQL執行過程(一)之 Executor

> 該系列文件是本人在學習 Mybatis 的原始碼過程中總結下來的,可能對讀者不太友好,請結合我的原始碼註釋([Mybatis原始碼分析 GitHub 地址](https://github.com/liu844869663/mybatis-3)、[Mybatis-Spring 原始碼分析 GitHub 地址

精盡MyBatis原始碼分析 - SQL執行過程(二)之 StatementHandler

> 該系列文件是本人在學習 Mybatis 的原始碼過程中總結下來的,可能對讀者不太友好,請結合我的原始碼註釋([Mybatis原始碼分析 GitHub 地址](https://github.com/liu844869663/mybatis-3)、[Mybatis-Spring 原始碼分析 GitHub 地址

精盡MyBatis原始碼分析 - SQL執行過程(三)之 ResultSetHandler

> 該系列文件是本人在學習 Mybatis 的原始碼過程中總結下來的,可能對讀者不太友好,請結合我的原始碼註釋([Mybatis原始碼分析 GitHub 地址](https://github.com/liu844869663/mybatis-3)、[Mybatis-Spring 原始碼分析 GitHub 地址

精盡MyBatis原始碼分析 - SQL執行過程(四)之延遲載入

> 該系列文件是本人在學習 Mybatis 的原始碼過程中總結下來的,可能對讀者不太友好,請結合我的原始碼註釋([Mybatis原始碼分析 GitHub 地址](https://github.com/liu844869663/mybatis-3)、[Mybatis-Spring 原始碼分析 GitHub 地址

Spark-Sql整合hive,在spark-sql命令和spark-shell命令下執行sql命令和整合調用hive

type with hql lac 命令 val driver spark集群 string 1.安裝Hive 如果想創建一個數據庫用戶,並且為數據庫賦值權限,可以參考:http://blog.csdn.net/tototuzuoquan/article/details/5