一個Spark maven專案打包並使用spark-submit執行
阿新 • • 發佈:2019-01-08
- 專案目錄名 countjpgs
- pom.xml檔案(位於專案目錄下)
- countjpgs => src => main => scala => stubs => CountJPGs.scala
- weblogs檔案存放在HDFS的/loudacre目錄下,是一個包含各種請求的web日誌檔案。
pom.xml檔案內容:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.cloudera.training.dev1</groupId> <artifactId>countjpgs</artifactId> <version>1.0</version> <packaging>jar</packaging> <name>"Count JPGs"</name> <properties> <spark-assembly>/usr/lib/spark/lib/spark-assembly.jar</spark-assembly> <hadoop-mapreduce-client-common>/usr/lib/hadoop/client/hadoop-mapreduce-client-common.jar</hadoop-mapreduce-client-common> <hadoop-mapreduce-client-core>/usr/lib/hadoop/client/hadoop-mapreduce-client-core.jar</hadoop-mapreduce-client-core> <hadoop-common>/usr/lib/hadoop/client/hadoop-common.jar</hadoop-common> <avro>/usr/lib/hadoop/client/avro.jar</avro> <commons-lang>/usr/lib/hadoop/client/commons-lang.jar</commons-lang> <guava>/usr/lib/hadoop/client/guava.jar</guava> <slf4j-api>/usr/lib/hadoop/client/slf4j-api.jar</slf4j-api> <slf4j-log4j12>/usr/lib/hadoop/client/slf4j-log4j12.jar</slf4j-log4j12> <hadoop-common>/usr/lib/hadoop/client/hadoop-common.jar</hadoop-common> <hadoop-annotations>/usr/lib/hadoop/client/hadoop-annotations.jar</hadoop-annotations> </properties> <repositories> <repository> <id>apache-repo</id> <name>Apache Repository</name> <url>https://repository.apache.org/content/repositories/releases</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </repository> <repository> <id>cloudera-repo-releases</id> <url>https://repository.cloudera.com/artifactory/repo/</url> </repository> </repositories> <build> <plugins> <plugin> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <version>2.15.2</version> <executions> <execution> <goals> <goal>compile</goal> </goals> </execution> </executions> </plugin> <plugin> <artifactId>maven-compiler-plugin</artifactId> <version>2.5.1</version> <configuration> <source>1.7</source> <target>1.7</target> </configuration> </plugin> </plugins> </build> <dependencies> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>2.10.5</version> <scope>system</scope> <systemPath>${spark-assembly}</systemPath> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>local</version> <scope>system</scope> <systemPath>${spark-assembly}</systemPath> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>local</version> <scope>system</scope> <systemPath>${hadoop-common}</systemPath> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-common</artifactId> <version>local</version> <scope>system</scope> <systemPath>${hadoop-mapreduce-client-common}</systemPath> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-annotations</artifactId> <version>local</version> <scope>system</scope> <systemPath>${hadoop-annotations}</systemPath> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>avro</artifactId> <version>local</version> <scope>system</scope> <systemPath>${avro}</systemPath> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>slf4j-log4j12</artifactId> <version>local</version> <scope>system</scope> <systemPath>${slf4j-log4j12}</systemPath> </dependency> </dependencies> </project>
CountJPGs.scala檔案內容:
package stubs import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object CountJPGs { def main(args: Array[String]) { if (args.length < 1) { System.err.println("Usage: CountJPGs <file>") System.exit(1) } //val sc = new SparkContext("hdfs","weblogs") val sc = new SparkContext() //val filepath = "/loudace/weblogs/*66" val logfile = args(0) val weblogs = sc.textFile(logfile) val weblogsJpg = weblogs.filter(_.contains(".jpg")) var weblogsJpgCount = weblogsJpg.count() println("JPG Count : "+weblogsJpgCount) sc.stop //TODO: complete exercise println("stub is not implemented") System.exit(1) } }
進入到專案根目錄countjpg資料夾下:
$ cd 專案存放路徑/countjpgs
打包程式:
$ mvn package
打包成功後,jar包會生成在target資料夾下,名稱和專案名類似:
還是進入到專案根目錄countjpg資料夾下:
$ cd 專案存放路徑/countjpgs
使用spark-submit命令執行程式:
$ spark-submit --class stubs.CountJPGs target/countjpgs-1.0.jar /loudacre/weblogs/*
輸出效果:
補充:提交到YARN叢集上面執行的命令:
$ spark-submit --class stubs.CountJPGs --master yarn-client --name 'Count JPGs' target/countjpgs-1.0.jar /loudacre/weblogs/*
另外可以在專案根目錄建立一個配置檔案,以便在使用spark-submit命令時呼叫:
$ vim myspark.conf
此檔案內容:
spark.app.name My Spark App
spark.master yarn-client
spark.executor.memory 400M
啟動命令:
$ spark-submit --properties-file myspark.conf --class stubs.CountJPGs target/loudacre/weblogs/*
然後就可以在YARN視覺化頁面看到相關的配置。