大資料實時計算Spark學習筆記（1）—— Spak單詞統計

阿新 • • 發佈：2018-12-30

1 啟動 Spark-shell

在這裡插入圖片描述

[[email protected] ~]$ spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/12/26 17:10:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/12/26 17:10:21 WARN SparkConf: 
SPARK_WORKER_INSTANCES was detected (set to '1').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with --num-executors to specify the number of executors
 - Or set SPARK_EXECUTOR_INSTANCES
 - spark.executor.instances to configure the number of instances in the spark config.
        
Spark context Web UI available at http://192.168.30.131:4040
Spark context available as 'sc' (master = local[*], app id = local-1545815422338).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.3
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

1.1 `sc`

SparkContext : Spark 程式的入口點，封裝了整個spark執行環境的資訊

1.2 統計單詞

words.txt

1.3 過濾單詞

在這裡插入圖片描述

2 IDEA 下maven建立 Spark 應用程式

在這裡插入圖片描述

2.1 修改 POM 檔案

修改 scala 版本

<properties>
    <scala.version>2.11.8</scala.version>
    <spark.version>2.2.0</spark.version 
>
  </properties>

新增 CDH 庫

 <!--新增庫-->
    <repository>
      <id>cloudera</id>
      <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
    </repository>

新增依賴

 <dependencies>
    <dependency>
      < 
groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>

  </dependencies>

2.2 API應用程式設計

SparkContext:Spark功能的主要入口點，代表到 Spark 叢集的連線，可以建立 RDD,累加器和廣播變數；
每個 JVM 只能啟用一個 SparkContext
SparkConf: Spark 配置物件，設定 Spark 應用各種引數，k-v形式

package com.bigdataSpark.cn

import org.apache.spark.{SparkConf, SparkContext}

object WordCountDemo {
    def main(args: Array[String]): Unit = {

        val conf = new SparkConf().setMaster("local[2]").setAppName("WordCountDemo")

        val sc = new SparkContext(conf)

        val rdd1 = sc.textFile("d:/words.txt")
        val rdd2 = rdd1.flatMap(_.split(" "))
        val rdd3 = rdd2.map((_,1))
        val rdd4 = rdd3.reduceByKey(_ + _)
        val r = rdd4.collect
        r.foreach(println)
        
        sc.stop()
        
    }
}

在這裡插入圖片描述

2.3 Java 版的 Spark API 單詞統計

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

public class WordCountJava {

    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("WordCountJava");

        JavaSparkContext sc = new JavaSparkContext(conf);

        JavaRDD<String> rdd1 = sc.textFile("d:/words.txt");

        JavaRDD<String> rdd2 = rdd1.flatMap(new FlatMapFunction<String, String>() {
            public Iterator<String> call(String s) throws Exception {

                List<String> list = new ArrayList<String>();

                String[] arr = s.split(" ");
                for (String ss : arr) {
                    list.add(ss);
                }

                return list.iterator();
            }
        });

        JavaPairRDD<String, Integer> rdd3 = rdd2.mapToPair(new PairFunction<String, String, Integer>() {
            public Tuple2<String, Integer> call(String s) throws Exception {
                return new Tuple2<String, Integer>(s, 1);
            }
        });

        JavaPairRDD<String, Integer> rdd4 = rdd3.reduceByKey(new Function2<Integer, Integer, Integer>() {
            public Integer call(Integer v1, Integer v2) throws Exception {
                return v1 + v2;
            }
        });

        List<Tuple2<String, Integer>> list = rdd4.collect();

        for (Tuple2<String, Integer> t : list) {
            System.out.println(t._1+":"+t._2);
        }

    }

}

2.4 打包在叢集執行

修改 Scala 版 wordcount 原始碼

package com.bigdataSpark.cn

import org.apache.spark.{SparkConf, SparkContext}

object WordCountDemo {
    def main(args: Array[String]): Unit = {

        val conf = new SparkConf()//.setMaster("local[2]").setAppName("WordCountDemo")

        val sc = new SparkContext(conf)

        val rdd1 = sc.textFile(args(0))
        val rdd2 = rdd1.flatMap(_.split(" "))
        val rdd3 = rdd2.map((_,1))
        val rdd4 = rdd3.reduceByKey(_ + _)
        val r = rdd4.collect
        r.foreach(println)

        sc.stop()

    }
}

然後用 Maven 打包，上傳到叢集
spark-submit

[[email protected] ~]$ spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.
      
[[email protected] ~]$

spark-submit --master local[2] --name WordCountScala --class com.bigdataSpark.cn.WordCountDemo /home/hadoop/myspark-1.0.jar /home/hadoop/words.txt

在這裡插入圖片描述

大資料實時計算Spark學習筆記（1）—— Spak單詞統計

1 啟動 Spark-shell [[email protected] ~]$ spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Se

大資料實時計算Spark學習筆記（4）—— Spak核心 API 模組介紹

1 Spark 介紹 1.1 Spark 特點速度：在記憶體中儲存中間結果支援多種語言內建 80+ 的運算元高階分析：MR,SQL/ Streaming/Mlib/Graph 1.2 Spark 模組 core : 通用執行

大資料實時計算Spark學習筆記（3）—— Spak Maven 編譯外掛

1 Scala Maven 編譯外掛 <build> <sourceDirectory>src/main/java</sourceDirectory> <plugins> <

大資料實時計算Spark學習筆記（2）—— Spak 叢集搭建

1 Spark 叢集模式 local: spark-shell --master local,預設的 standlone 1.複製 spark 目錄到其他主機 2.配置其他主機的環境變數 3.配置 master 節點的 slaves 檔案 4.啟動 spark

大資料實時計算Spark學習筆記（7）—— RDD 資料傾斜處理

1 處理資料傾斜在 reduceByKey 之前先進行隨機分割槽 package com.bigdataSpark.cn import org.apache.spark.{SparkConf, SparkContext} import scala.util.Ran

大資料實時計算Spark學習筆記（10）—— Spar SQL(2) -JDBC方式操作表

1 Spark SQL 的 JDBC 方式 POM 檔案新增依賴 <dependency> <groupId>mysql</groupId> <artifactId>mysql-connect

大資料實時計算Spark學習筆記（9）—— Spar SQL(1) 讀取 json 檔案

1 Spark SQL 程式設計方式：（1）SQL;(2) DataFrame API scala> case class Customer(id:Int,name:String,age:Int) defined class Customer scala&g

大資料實時計算Spark學習筆記（8）—— RDD 持久化

1 RDD 持久化跨操作進行RDD的記憶體式儲存；持久化 RDD時，節點上的每個分割槽都會儲存到記憶體中；快取技術是迭代計算和互動式查詢的重要工具；使用 persist() 和 cache() 進行 RDD 的持久化，cache() 是 perisi

大資料實時計算Spark學習筆記（5）—— RDD的 transformation

1 RDD的轉換 1.1 groupByKey (k,v) => (k,Iterable) package com.bigdataSpark.cn import org.apache.spark.{SparkConf, SparkContext} o

大資料實時計算Spark學習筆記（11）—— Spark Streaming

1 Spark Streaming spark core 的擴充套件，針對實時資料處理，具有可擴充套件、高吞吐、容錯；內部，spark 接受實時資料流，分成 batch 進行處理，最終在每個 batch 產生結果； 1.1 discretized strea

Spark學習筆記（1）—— Spark 介紹，叢集安裝

1 Spark 介紹 Spark是一種快速、通用、可擴充套件的大資料分析引擎，2009年誕生於加州大學伯克利分校AMPLab，2010年開源，2013年6月成為Apache孵化專案，2014年2月成為Apache頂級專案。目前，Spark生態系統已經發展成為一個

Python資料視覺化-Matplotlib學習筆記（1）--折線圖為例畫圖入門

在使用Python做資料處理的時，大量的資料我們看起來並不是很直觀，有時候把它圖形化顯示反而更能容易的觀察資料的變化特徵等等。 Matplotlib是一個Python的2D繪相簿，它以各種硬拷貝格式和跨平臺的互動式環境生成出版質量級別的圖形。它提供了一整套

spark快速大資料分析學習筆記（1）

本文是《spark快速大資料分析學習》第三章學習筆記，文中大量摘抄書中原本，僅為個人學習筆記。 RDD基礎： RDD是一個不可變的分散式物件集合。每個RDD都被分為多個分割槽，這個分割槽執行在叢集的不同節點上。RDD可以包含Python、Java、Scala中任意型別的物件。建立RDD的方式：

Spark學習筆記（14）——Spark Streaming 資料累加的案例

1 原始碼 package mystreaming import org.apache.spark.{HashPartitioner, SparkConf, SparkContext} import org.apache.spark.streaming.{Seconds, Stre

Spark學習筆記（3）—— Spark計算模型 RDD

1 彈性分散式資料集RDD 1.1 什麼是 RDD RDD（Resilient Distributed Dataset）叫做分散式資料集，是Spark中最基本的資料抽象，它代表一個不可變、可分割槽、裡面的元素可平行計算的集合。RDD具有資料流模型的特點：自動容錯

《資料演算法-Hadoop/Spark大資料處理技巧》讀書筆記（一）——二次排序

寫在前面：在做直播的時候有同學問Spark不是用Scala語言作為開發語言麼，的確是的，從網上查資料的話也會看到大把大把的用Scala編寫的Spark程式，但是仔細看就會發現這些用Scala寫的文章

《資料演算法-Hadoop/Spark大資料處理技巧》讀書筆記（四）——移動平均

移動平均：對時序序列按週期取其值的平均值，這種運算被稱為移動平均。典型例子是求股票的n天內的平均值。移動平均的關鍵是如何求這個平均值，可以使用Queue來實現。 public class MovingAverageDriver { public

spark學習筆記（3）spark核心資料結構RDD

一個簡單的例子 /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.Spar

【筆記篇】最良心的計算幾何學習筆記（一）

變量類型其他條件 parallel node ons put 是否通過世界以痛吻我，我卻報之以歌。開新坑... 雖然不知道這坑要填多久... 文章同步上傳到github... 有想看的可以去看看→_→ *溫馨提示: 看本文之前請務必學習或回顧數學-必修2的解析

【筆記篇】最良心的計算幾何學習筆記（六）

紅色 online src note 不變比較基礎知識 cst 分類半平面交 github傳送門簡介 Emmmm學完旋轉卡殼感覺自己已經是個廢人了.. 修整了一個周末, 回來接著跟計算幾何勢力硬幹... (這個周末是不是有點長?) 今天就講講半平面交吧. 請自己回顧

大資料實時計算Spark學習筆記（1）—— Spak單詞統計

1 啟動 Spark-shell

1.1 sc

1.2 統計單詞

1.3 過濾單詞

2 IDEA 下maven建立 Spark 應用程式

2.1 修改 POM 檔案

2.2 API應用程式設計

2.3 Java 版的 Spark API 單詞統計

2.4 打包在叢集執行

相關推薦

1.1 `sc`