Spark原始碼分析之Spark Shell（上）

阿新 • • 發佈：2018-10-31

https://www.cnblogs.com/xing901022/p/6412619.html

文中分析的spark版本為apache的spark-2.1.0-bin-hadoop2.7。

bin目錄結構：

-rwxr-xr-x. 1 bigdata bigdata 1089 Dec 15  2016 beeline
-rw-r--r--. 1 bigdata bigdata  899 Dec 15  2016 beeline.cmd
-rw-rw-r--. 1 bigdata bigdata  776 Sep 18 06:27 derby.log
-rwxr-xr-x. 1 bigdata bigdata 1933 Dec 15  2016 find-spark-home
-rw-r--r--. 1 bigdata bigdata 1909 Dec 15  2016 load-spark-env.cmd
-rw-r--r--. 1 bigdata bigdata 2133 Dec 15  2016 load-spark-env.sh
drwxrwxr-x. 5 bigdata bigdata 4096 Sep 18 06:27 metastore_db
-rwxr-xr-x. 1 bigdata bigdata 2989 Dec 15  2016 pyspark
-rw-r--r--. 1 bigdata bigdata 1493 Dec 15  2016 pyspark2.cmd
-rw-r--r--. 1 bigdata bigdata 1002 Dec 15  2016 pyspark.cmd
-rwxr-xr-x. 1 bigdata bigdata 1030 Dec 15  2016 run-example
-rw-r--r--. 1 bigdata bigdata  988 Dec 15  2016 run-example.cmd
-rwxr-xr-x. 1 bigdata bigdata 3116 Dec 15  2016 spark-class
-rw-r--r--. 1 bigdata bigdata 2236 Dec 15  2016 spark-class2.cmd
-rw-r--r--. 1 bigdata bigdata 1012 Dec 15  2016 spark-class.cmd
-rwxr-xr-x. 1 bigdata bigdata 1039 Dec 15  2016 sparkR
-rw-r--r--. 1 bigdata bigdata 1014 Dec 15  2016 sparkR2.cmd
-rw-r--r--. 1 bigdata bigdata 1000 Dec 15  2016 sparkR.cmd
-rwxr-xr-x. 1 bigdata bigdata 3017 Dec 15  2016 spark-shell
-rw-r--r--. 1 bigdata bigdata 1530 Dec 15  2016 spark-shell2.cmd
-rw-r--r--. 1 bigdata bigdata 1010 Dec 15  2016 spark-shell.cmd
-rwxr-xr-x. 1 bigdata bigdata 1065 Dec 15  2016 spark-sql
-rwxr-xr-x. 1 bigdata bigdata 1040 Dec 15  2016 spark-submit
-rw-r--r--. 1 bigdata bigdata 1128 Dec 15  2016 spark-submit2.cmd
-rw-r--r--. 1 bigdata bigdata 1012 Dec 15  2016 spark-submit.cmd

先來介紹一下Spark-shell是什麼？

Spark-shell是提供給使用者即時互動的一個命令視窗，你可以在裡面編寫spark程式碼，然後根據你的命令立即進行運算。這種東西也被叫做REPL,(Read-Eval-Print Loop)互動式開發環境。

先來粗略的看一眼，其實沒有多少程式碼：

#!/usr/bin/env bash

# Shell script for starting the Spark Shell REPL

cygwin=false
case "$(uname)" in
  CYGWIN*) cygwin=true;;
esac

# Enter posix mode for bash
set -o posix

if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]"

# SPARK-4161: scala does not assume use of the java classpath,
# so we need to add the "-Dscala.usejavacp=true" flag manually. We
# do this specifically for the Spark shell because the scala REPL
# has its own class loader, and any additional classpath specified
# through spark.driver.extraClassPath is not automatically propagated.
SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true"

function main() {
  if $cygwin; then
    # Workaround for issue involving JLine and Cygwin
    # (see http://sourceforge.net/p/jline/bugs/40/).
    # If you're using the Mintty terminal emulator in Cygwin, may need to set the
    # "Backspace sends ^H" setting in "Keys" section of the Mintty options
    # (see https://github.com/sbt/sbt/issues/562).
    stty -icanon min 1 -echo > /dev/null 2>&1
    export SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Djline.terminal=unix"
    "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" " 
[email protected]"
    stty icanon echo > /dev/null 2>&1
  else
    export SPARK_SUBMIT_OPTS
    "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "[email protected]"
  fi
}

# Copy restore-TTY-on-exit functions from Scala script so spark-shell exits properly even in
# binary distribution of Spark where Scala is not installed
exit_status=127
saved_stty=""

# restore stty settings (echo in particular)
function restoreSttySettings() {
  stty $saved_stty
  saved_stty=""
}

function onExit() {
  if [[ "$saved_stty" != "" ]]; then
    restoreSttySettings
  fi
  exit $exit_status
}

# to reenable echo if we are interrupted before completing.
trap onExit INT

# save terminal settings
saved_stty=$(stty -g 2>/dev/null)
# clear on error so we don't later try to restore them
if [[ ! $? ]]; then
  saved_stty=""
fi

main " 
[email protected]"

# record the exit status lest it be overwritten:
# then reenable echo and propagate the code.
exit_status=$?
onExit

其實這個指令碼只能看出來是呼叫了spark-submit，後續會再分析一下spark-submit的作用（它裡面會呼叫spark-class，這才是執行方法的最終執行者，前面都是傳參而已）。

最前面的

cygwin=false
case "$(uname)" in
  CYGWIN*) cygwin=true;;
esac

這個在很多的啟動指令碼中都可以看到，是檢查你的系統是否屬於cygwin。使用了uname命令，這個命令通常用於查詢系統的名字或者核心版本號

uname可以檢視作業系統的名字，詳情參考 man uname.直接輸入uname，一般顯示Linux；使用uname -r 可以檢視核心版本；使用uname -a 可以檢視所有的資訊

set -o posix

設定shell的模式為POSIX標準模式，不同的模式對於一些命令和操作不一樣。Posix : Portable Operating System Interface of Unix它提供了作業系統的一套介面。

if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

第一個if語句if [ -z "${SPARK_HOME}" ]; then用於檢測是否設定過SPARK_HOME環境變數。

在shell裡面條件表示式有非常多的用法,比如：

# 檔案表示式
if [ -f  file ]    如果檔案存在
if [ -d ...   ]    如果目錄存在
if [ -s file  ]    如果檔案存在且非空 
if [ -r file  ]    如果檔案存在且可讀
if [ -w file  ]    如果檔案存在且可寫
if [ -x file  ]    如果檔案存在且可執行   

# 整數變量表達式
if [ int1 -eq int2 ]    如果int1等於int2   
if [ int1 -ne int2 ]    如果不等於    
if [ int1 -ge int2 ]    如果>=
if [ int1 -gt int2 ]    如果>
if [ int1 -le int2 ]    如果<=
if [ int1 -lt int2 ]    如果<
   

#    字串變量表達式
If  [ $a = $b ]                 如果string1等於string2,字串允許使用賦值號做等號
if  [ $string1 !=  $string2 ]   如果string1不等於string2       
if  [ -n $string  ]             如果string 非空(非0），返回0(true)  
if  [ -z $string  ]             如果string 為空
if  [ $sting ]                  如果string 非空，返回0 (和-n類似)

所以上面的那句判斷，就是檢查${SPARK_HOME}是否為空的意思。

source命令用於呼叫另一個指令碼。

source "$(dirname "$0")"/find-spark-home

上面這句話整個的意思就是呼叫當前指令碼所在目錄中find-spark-home這個指令碼。

我們具體分析一下：

首先$0是shell中的變數符號，類似的還有很多:

$# 是傳給指令碼的引數個數
$0 是指令碼本身的名字
$1 是傳遞給該shell指令碼的第一個引數
$2 是傳遞給該shell指令碼的第二個引數
[email protected] 是傳給指令碼的所有引數的列表
$* 是以一個單字串顯示所有向指令碼傳遞的引數，與位置變數不同，引數可超過9個
$$ 是指令碼執行的當前程序ID號
$? 是顯示最後命令的退出狀態，0表示沒有錯誤，其他表示有錯誤

最常用的應該是$0和[email protected]。

在說說dirname命令，這個命令用於顯示某個檔案所在的路徑。比如我有一個檔案/home/xinghl/test/test1,在test目錄中使用dirname test1，就會返回:

[[email protected] test]# pwd
/home/xinghl/test
[[email protected] test]# ll
總用量 4
-rw-r--r-- 1 root root 27 2月  17 10:48 test1
[[email protected] test]# dirname test1

我們要的其實就是那個點，在linux中.代表當前目錄。..代表父目錄。

SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true"

因為scala預設不會使用java classpath，因此這裡需要手動設定一下，讓scala使用java。

就先介紹到這吧.....後面再介紹下，spark-shell視窗的原理。

作者：xingoo

出處：http://www.cnblogs.com/xing901022

本文版權歸作者和部落格園共有。歡迎轉載，但必須保留此段宣告，且在文章頁面明顯位置給出原文連線！

雲客Drupal8原始碼分析之外掛系統（上）

各位《雲客drupal8原始碼分析》系列的讀者：本系列一直以每週一篇的速度進行部落格原創更新，希望幫助大家理解drupal8底層原理，並縮短學習時間，但自《外掛系統（上）》主題開始部落格僅釋出前言和目錄，這是因為雲客在思考一個問題：drupal在國外如此流行但在國內卻很小

Spark原始碼分析之Spark Shell（上）

https://www.cnblogs.com/xing901022/p/6412619.html 文中分析的spark版本為apache的spark-2.1.0-bin-hadoop2.7。 bin目錄結構： -rwxr-xr-x. 1 bigdata bigdata 1089 Dec

spark mllib原始碼分析之L-BFGS（一）

1. 使用 spark給出的example中涉及到LBFGS有兩個，分別是LBFGSExample.scala和LogisticRegressionWithLBFGSExample.scala，第一個是直接使用LBFGS直接訓練，需要指定一系列優化引數，優

redis源碼分析之事務Transaction（上）

訂閱 exe uri 興趣閱讀包含如果舉例 blog 這周學習了一下redis事務功能的實現原理，本來是想用一篇文章進行總結的，寫完以後發現這塊內容比較多，而且多個命令之間又互相依賴，放在一篇文章裏一方面篇幅會比較大，另一方面文章組織結構會比較亂，不容易閱讀。因此

雲客Drupal8原始碼分析之外掛系統（下）

以下內容僅是一個預覽，完整內容請見文尾：至此本系列對外掛的介紹全部完成，涵蓋了系統外掛的所有知識全文目錄（全文10476字）：例項化外掛外掛對映Plugin mapping 外掛上下文

elasticsearch原始碼分析之分片分配（十）

分片什麼是分片分片是把索引資料切分成多個小的索引塊，這些小的索引塊能夠分發到同一個叢集中的不同節點。在檢索時，檢索結果是該索引每個分片上檢索結果的合併。類似於資料庫的分庫分表。為什麼分片 1、這樣可以提高讀寫效能，實現負載均衡。 2、副本容易

elasticsearch原始碼分析之索引操作（九）

上節介紹了es的node啟動如何建立叢集服務的過程，這節在其基礎之上介紹es索引的基本操作功能（create、exist、delete），用來進一步細化es叢集是如果工作的。客戶端部分的操作就不予介紹了，詳細可以參照elasticsearch原始碼分析之客戶

elasticsearch原始碼分析之服務端（四）

上篇部落格說明了客戶端的情況，現在繼續分析服務端都幹了些啥，es是怎麼把資料插進去的，此處以transport的bulk為入口來探究，對於單個document的傳送就忽略了。一、服務端接收 1.1接收訊息在客戶端分析中已經提到，netty中通訊的處理類是Mes

elasticsearch原始碼分析之啟動過程（二）

最近開始廣泛的使用elasticsearch，也開始寫一些java程式碼了，為了提高java程式碼能力，也為了更加深入一點了解elasticsearch的內部運作機制，所以開始看一些elasticsearch的原始碼了。對於這種廣受追捧的開源專案，細細品讀一定會受益匪淺，

Android4.4.2原始碼分析之WiFi模組（一）

已經寫了幾篇關於Android原始碼的，原始碼程式碼量太大，所以如果想分析某個模組可能不知如何下手，說一下思路 1，分析原始碼英文閱讀能力要夠，想要分析某個模組一般找模組對應的英文，就是模組 2，找到之後首先檢視清單配置檔案Androidmani.fest，找到程式主介面activity 3，通過檢視配置檔

Vue原始碼分析之依賴收集（九）

依賴收集就是訂閱資料變化watcher的收集，依賴收集的目的是當響應式資料發生變化時，能夠通知相應的訂閱者去處理相關的邏輯。在上一章，介紹了Vue將普通物件變成響應式物件是利用defineReactive()（定義在'core/observer/index.js'中）函式，d

雲客Drupal8原始碼分析之實體Entity（二）配置實體基類

配置實體基類是系統定義的一個用於配置實體的抽象基類，繼承自實體基類，完成了配置實體的大部分通用功能，具體的配置實體往往會繼承它，比如使用者角色實體，這樣寫少量程式碼即可，類定義如下： Drupal\Core\Config\Entity\ConfigEntityBase 實

雲客Drupal8原始碼分析之實體Entity（五）內容實體基類

原始碼分析重點在於在自己的大腦中重現開發者的思維過程，內容實體基類是drupal中很大的一個類，她要處理眾多的問題，內容實體的大多數功能都集中在這裡，開發者有許多的考慮，要弄清楚她的所有細節，學習者可能會覺得有些困難，這時需要明白任何複雜龐大的事物都是一步步累積發展起來的，

Android4.4.2原始碼分析之WiFi模組（二）

接著上一篇繼續對WiFi原始碼的分析 onResume方法中 6>，首先是呼叫WiFiEnabler的resume方法對switch進行管理接下來註冊廣播 getActivity().registerReceiver(mReceiver, mFilter);

Memcached原始碼分析之訊息迴應（3）

文章列表：《Memcached原始碼分析 - Memcached原始碼分析之總結篇（8）》前言上一章《Memcached原始碼分析 - Memcached原始碼分析之命令解析（2）》，我們花了很大的力氣去講解Memcached如何從客戶端讀取命令，並且

Spark core原始碼分析之spark叢集的啟動（二）

2.2 Worker的啟動 org.apache.spark.deploy.worker 1 從Worker的伴生物件的main方法進入在main方法中首先是得到一個SparkConf例項conf，然後將conf和啟動Worker傳入的引數封裝得到Wor

Spark 原始碼分析之ShuffleMapTask處理

Spark 原始碼分析之ShuffleMapTask處理更多資源 SPARK 原始碼分析技術分享(bilibilid視訊彙總套裝視訊): https://www.bilibili.com/video/av37442139/ github: https://github.com

Spark原始碼分析之ResultTask處理

Spark原始碼分析之ResultTask處理更多資源 SPARK 原始碼分析技術分享(bilibilid視訊彙總套裝視訊): https://www.bilibili.com/video/av37442139/ github: https://github.com/open

Spark原始碼分析之ShuffleMapTask處理

Spark原始碼分析之ShuffleMapTask處理更多資源 SPARK 原始碼分析技術分享(bilibilid視訊彙總套裝視訊): https://www.bilibili.com/video/av37442139/ github: https://github.com/opensour

Spark 原始碼分析之ShuffleMapTask記憶體資料Spill和合並

前置條件 Hadoop版本: Hadoop 2.6.0-cdh5.15.0 Spark版本: SPARK 1.6.0-cdh5.15.0 JDK.1.8.0_191 scala2.10.7 技能標籤 Spark ShuffleMapTask 記憶體中的資

Spark原始碼分析之Spark Shell（上）

相關推薦