Python工程師的大資料之路(七a)Hadoop,ZooKeeper,HIVE,Spark叢集部署
阿新 • • 發佈:2021-01-24
技術標籤:大資料
JDK,Hadoop,ZooKeeper,HIVE,Spark一波流安裝
1、環境說明
環境 | 版本 | 下載地址 |
---|---|---|
CentOS | 7-5 | |
JDK | 1.8 | https://download.csdn.net/download/Yellow_python/13782524 |
Hadoop | 3.1.3 | 同上JDK |
HIVE | 3.1.2 | 同上JDK |
MySQL | 5.7.32 | https://dev.mysql.com/downloads/mysql/ |
MySQL的JDBC | 5.1.49 | https://dev.mysql.com/downloads/connector/j/ |
ZooKeeper | 3.5.7 | https://mirrors.bfsu.edu.cn/apache/zookeeper/ |
叢集規劃 | 服務名 | hadoop100 | hadoop101 | hadoop102 |
---|---|---|---|---|
Hadoop(HDFS) | DataNode | 1 | 1 | 1 |
Hadoop(HDFS) | NameNode | 1 | 1 | |
Hadoop(ZKFC) | DFSZKFailoverController | 1 | 1 | |
Hadoop(HDFS) | JournalNode | 1 | 1 | 1 |
ZooKeeper | QuorumPeerMain | 1 | 1 | 1 |
Hadoop(YARN) | ResourceManager | 1 | 1 | |
Hadoop(YARN) | NodeManager | 1 | 1 | 1 |
Spark | 1 | 1 | 1 | |
HIVE | 1 | |||
MySQL | 1 |
先裝好必備Linux命令,其中
psmisc
是【ssh fence】的前提
yum -y install psmisc net-tools vim rsync tree lrzsz
2、網路配置和免密登入
https://yellow520.blog.csdn.net/article/details/110143502
叢集免密登入配好後,下述命令全在hadoop100
上,不用切來切去
3、環境變數
https://yellow520.blog.csdn.net/article/details/112692486
4、MySQL安裝
https://yellow520.blog.csdn.net/article/details/113036158
5、解壓Java,Hadoop,ZooKeeper,HIVE,Spark
tar -zxf jdk-8u212-linux-x64.tar.gz -C /opt/
tar -zxf hadoop-3.1.3.tar.gz -C /opt/
tar -zxf apache-zookeeper-3.5.7-bin.tar.gz -C /opt/
tar -zxvf spark-3.0.1-bin-hadoop2.7.tgz -C /opt/
tar -zxf apache-hive-3.1.2-bin.tar.gz -C /opt/
cd /opt
mv jdk1.8.0_212 jdk
mv hadoop-3.1.3 hadoop
mv apache-zookeeper-3.5.7-bin zookeeper
mv spark-3.0.1-bin-hadoop2.7 spark
mv apache-hive-3.1.2-bin hive
chown -R root:root jdk hadoop zookeeper spark hive
ll
6、配置檔案
6.1、Hadoop配置
Hadoop核心配置,加入configuration
vi $HADOOP_HOME/etc/hadoop/core-site.xml
<!-- 存放Hadoop資料的總目錄 -->
<property>
<name>hadoop.data.dir</name>
<value>/opt/hadoop/data</value>
</property>
<!-- 以下為高可用配置 -->
<!-- Hadoop客戶端的預設路徑字首 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://myha</value>
</property>
<!-- 用於ZKFC自動故障轉移的ZK叢集地址列表 -->
<property>
<name>ha.zookeeper.quorum</name>
<value>hadoop100:2181,hadoop101:2181,hadoop102:2181</value>
</property>
<!-- ZK叢集地址 -->
<property>
<name>hadoop.zk.address</name>
<value>hadoop100:2181,hadoop101:2181,hadoop102:2181</value>
</property>
HDFS配置,加入configuration
vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<!-- NameNode資料存放的目錄 -->
<property>
<name>dfs.namenode.name.dir</name>
<value>file://${hadoop.data.dir}/name</value>
</property>
<!-- DataNode資料存放的目錄 -->
<property>
<name>dfs.datanode.data.dir</name>
<value>file://${hadoop.data.dir}/data</value>
</property>
<!-- 以下為高可用配置 -->
<!-- nameservices的邏輯名稱 -->
<property>
<name>dfs.nameservices</name>
<value>myha</value>
</property>
<!-- nameservices中每個NN的唯一識別符號 -->
<property>
<name>dfs.ha.namenodes.myha</name>
<value>nn0,nn1</value>
</property>
<!-- 每個NN的RPC地址 -->
<property>
<name>dfs.namenode.rpc-address.myha.nn0</name>
<value>hadoop100:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.myha.nn1</name>
<value>hadoop101:8020</value>
</property>
<!-- 每個NN的HTTP地址 -->
<property>
<name>dfs.namenode.http-address.myha.nn0</name>
<value>hadoop100:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.myha.nn1</name>
<value>hadoop101:9870</value>
</property>
<!-- NN元資料共享編輯日誌的儲存位置,即JournalNode列表 -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop100:8485;hadoop101:8485;hadoop102:8485/myha</value>
</property>
<!-- DFS客戶端將使用該Java類來確定哪個NN是活躍的 -->
<property>
<name>dfs.client.failover.proxy.provider.myha</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!-- 故障轉移時用來隔離活躍NN的方法 -->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<!-- 若使用sshfence方法,配置sshfence的SSH私鑰路徑 -->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>
<!-- JournalNode在本地磁碟存放資料的位置(要求絕對路徑) -->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/opt/hadoop/data/jn</value>
</property>
<!-- 啟用自動故障轉移 -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
YARN配置,加入configuration
vi $HADOOP_HOME/etc/hadoop/yarn-site.xml
<!-- 配置成 mapreduce_shuffle 才可執行 MapReduce -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 不 檢查實體記憶體 -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<!-- 不 檢查虛擬記憶體 -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!-- 以下為高可用配置 -->
<!-- 啟用RM高可用 -->
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<!-- RM叢集的邏輯ID -->
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>rmha</value>
</property>
<!-- RM節點的邏輯ID列表 -->
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm0,rm2</value>
</property>
<!-- RM節點對應的主機名 -->
<property>
<name>yarn.resourcemanager.hostname.rm0</name>
<value>hadoop100</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>hadoop102</value>
</property>
<!-- RM網路應用程式地址 -->
<property>
<name>yarn.resourcemanager.webapp.address.rm0</name>
<value>hadoop100:8088</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>hadoop102:8088</value>
</property>
MapReduce配置,加入configuration
vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
<!-- 讓 MapReduce 執行在 YARN 上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- Map記憶體,單位MB-->
<property>
<name>mapreduce.map.memory.mb</name>
<value>512</value>
</property>
<!-- Reduce記憶體,單位MB -->
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1024</value>
</property>
<!-- Map下Java程式最大記憶體 -->
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx384m</value>
</property>
<!-- Reduce下Java程式最大記憶體 -->
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx768m</value>
</property>
<!-- 下面連續3個是HADOOP_MAPRED_HOME相關的 -->
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
從機配置,
-e
指(enable interpretation of backslash escapes)啟用反斜槓轉義的解釋
echo -e "hadoop100\nhadoop101\nhadoop102" > $HADOOP_HOME/etc/hadoop/workers
cat $HADOOP_HOME/etc/hadoop/workers
6.2、ZooKeeper配置
cd $ZOOKEEPER_HOME/conf
cp zoo_sample.cfg zoo.cfg
vi zoo.cfg
# 資料存放路徑,建議寫絕對路徑
dataDir=/opt/zookeeper/zkData
# 叢集伺服器配置
server.0=hadoop100:2888:3888
server.1=hadoop101:2888:3888
server.2=hadoop102:2888:3888
6.3、Spark配置
Spark的YARN叢集模式
echo 'export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop' >> $SPARK_HOME/conf/spark-env.sh
7、檔案分發
rsync -a $JAVA_HOME/ hadoop101:$JAVA_HOME/
rsync -a $JAVA_HOME/ hadoop102:$JAVA_HOME/
rsync -a $HADOOP_HOME/ hadoop101:$HADOOP_HOME/
rsync -a $HADOOP_HOME/ hadoop102:$HADOOP_HOME/
rsync -a $ZOOKEEPER_HOME/ hadoop101:$ZOOKEEPER_HOME/
rsync -a $ZOOKEEPER_HOME/ hadoop102:$ZOOKEEPER_HOME/
rsync -a $SPARK_HOME/ hadoop101:$SPARK_HOME/
rsync -a $SPARK_HOME/ hadoop102:$SPARK_HOME/
mkdir $ZOOKEEPER_HOME/zkData
ssh hadoop101 "mkdir $ZOOKEEPER_HOME/zkData"
ssh hadoop102 "mkdir $ZOOKEEPER_HOME/zkData"
echo 0 > $ZOOKEEPER_HOME/zkData/myid
ssh hadoop101 "echo 1 > $ZOOKEEPER_HOME/zkData/myid"
ssh hadoop102 "echo 2 > $ZOOKEEPER_HOME/zkData/myid"
8、初次啟動
1、啟動ZooKeeper
zkServer.sh start
ssh hadoop101 'zkServer.sh start'
ssh hadoop102 'zkServer.sh start'
2、Hadoop初次啟動
# 2、啟動QJM叢集
hdfs --daemon start journalnode
ssh hadoop101 'hdfs --daemon start journalnode'
ssh hadoop102 'hdfs --daemon start journalnode'
# 3、格式化nn0
hdfs namenode -format
# 4、啟動nn0
hdfs --daemon start namenode
# 5、nn1同步nn0
ssh hadoop101 'hdfs namenode -bootstrapStandby'
# 6、啟動nn1
ssh hadoop101 'hdfs --daemon start namenode'
# 7、初始化HA在Zookeeper中狀態
hdfs zkfc -formatZK
ssh hadoop101 'hdfs zkfc -formatZK'
# 8、啟動nn1和nn2的ZKFC
hdfs --daemon start zkfc
ssh hadoop101 'hdfs --daemon start zkfc'
# 9、啟動全部DataNode
hdfs --daemon start datanode
ssh hadoop101 'hdfs --daemon start datanode'
ssh hadoop102 'hdfs --daemon start datanode'
# 10、啟動YR
start-yarn.sh
3、按序執行過上述步驟1~10一次,以後就可以用以下命令啟停叢集
start-dfs.sh
start-yarn.sh
stop-dfs.sh
stop-yarn.sh
4、檢視叢集狀態
hdfs haadmin -getServiceState nn0
hdfs haadmin -getServiceState nn1
yarn rmadmin -getServiceState rm0
yarn rmadmin -getServiceState rm2
5、測試Spark
spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
$SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar \
10
9、HIVE
1、上傳MySQL的JDBC到HIVE和Sqoop的
lib
下
cp mysql-connector-java-5.1.49.jar $HIVE_HOME/lib/
2、刪除HIVE日誌jar衝突
cd $HIVE_HOME/lib
mv log4j-slf4j-impl-2.10.0.jar log4j-slf4j-impl-2.10.0.jar.bak
3、HIVE元資料庫配到MySQL
vim $HIVE_HOME/conf/hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- 元資料 配到 MySQL的庫名(hive),自動建庫,UTF-8字符集,不用SSL -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop100:3306/hive?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8&useSSL=false</value>
</property>
<!-- MySQL的JDBC驅動 -->
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<!-- mysql使用者名稱 -->
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<!-- mysql密碼 -->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
</configuration>
$HIVE_HOME/bin/schematool -initSchema -dbType mysql -verbose
4、解決元資料中文亂碼
mysql -uroot -p123456
USE hive
ALTER TABLE COLUMNS_V2 MODIFY COLUMN `COMMENT` VARCHAR(256) CHARACTER SET utf8;
ALTER TABLE TABLE_PARAMS MODIFY COLUMN PARAM_VALUE VARCHAR(4000) CHARACTER SET utf8;
ALTER TABLE PARTITION_PARAMS MODIFY COLUMN PARAM_VALUE VARCHAR(4000) CHARACTER SET utf8;
ALTER TABLE PARTITION_KEYS MODIFY COLUMN PKEY_COMMENT VARCHAR(4000) CHARACTER SET utf8;
ALTER TABLE INDEX_PARAMS MODIFY COLUMN PARAM_VALUE VARCHAR(4000) CHARACTER SET utf8;
quit;
6、測試HIVE
hive -e 'CREATE DATABASE b1;'
hive -e 'CREATE TABLE b1.t(f STRING COMMENT "中")COMMENT "文";'
hive -e 'SHOW CREATE TABLE b1.t;'
hive -e 'INSERT INTO TABLE b1.t VALUES("漢語");'
hive -e 'SELECT * FROM b1.t;'
hive -e 'DROP DATABASE b1 CASCADE;'
hive -e 'SHOW DATABASES;'