1. 程式人生 > 其它 >centos7安裝hadoop偽分佈叢集

centos7安裝hadoop偽分佈叢集

1、獲取hadoop

使用國內映象下載速度很快,清華映象地址:Index of /apache (tsinghua.edu.cn)

找到hadoop目錄

點選common

選擇自己需要的版本

本示例使用hadoop-3.2.2

2、因環境受限,使用偽分佈的安裝方式

ip:設定靜態ip

[root@bigdata01 ~]# vi /etc/sysconfig/network-scripts/ifcfg-ens33

修改以下引數:

BOOTPROTO="static"

新增:

IPADDR=192.168.184.128
GATEWAY=192.168.184.2
DNS1=192.168.184.2

儲存後重啟網路

[root@192 ~]# service network restart
Restarting network (via systemctl): [ OK ]

3、設定永久主機名

[root@192 ~]# vi /etc/hostname

4、關閉防火牆

臨時關閉防火牆

[root@192 ~]# systemctl stop firewalld

檢視防火牆狀態

[root@bigdata01 ~]# systemctl status firewalld

永久關閉防火牆

[root@bigdata01 ~]# systemctl disable firewalld
Removed symlink /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.

5、實現ssh免密登入

現在伺服器上執行以下命令,rsa表示的是一種加密演算法,執行命令後需要預設連續按4次回車回到命令列,不需輸入任何內容。

[root@bigdata01 ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:J1ZsjlwvB9MZBUylhVPAjj6gQJgshpCs14zETqQmLhM root@bigdata01
The key's randomart image is:
+---[RSA 2048]----+
|==.o +=B= |
|++B . . .== |
|E* = Bo+. |
|=.+ + ..*.+. |
|oo . .So+ o |
|.. .. ooo |
| . |
| |
| |
+----[SHA256]-----+
執行以後會在~/.ssh目錄下生產對應的公鑰和祕鑰檔案

[root@bigdata01 ~]# ll ~/.ssh
total 12
-rw-------. 1 root root 1679 Jul 3 17:42 id_rsa
-rw-r--r--. 1 root root 396 Jul 3 17:42 id_rsa.pub
-rw-r--r--. 1 root root 203 Jul 3 17:41 known_hosts

上述檔案中pub表示公鑰檔案

下一步是把公鑰拷貝到需要免密碼登入的機器上面

因本示例只有一臺伺服器,所以拷貝到本伺服器對應目錄中

[root@bigdata01 ~]# cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

>>表示重定向

至此就可以通過ssh免密登入。

[root@bigdata01 ~]# ssh bigdata01
Last login: Sat Jul 3 17:52:45 2021 from fe80::3c67:9a76:5a25:c4d9%ens33

6、安裝jdk

建立data/soft目錄用於存放安裝檔案

[root@bigdata01 /]# mkdir -p /data/soft

將jdk檔案上傳到/data/soft目錄下

[root@bigdata01 soft]# ll
total 190424
-rw-r--r--. 1 root root 194990602 Jul 3 18:04 jdk-8u211-linux-x64.tar.gz

解壓jdk安裝包

[root@bigdata01 soft]# tar -zxvf jdk-8u211-linux-x64.tar.gz

解壓後的資料夾名稱有點長,我們修改一下

[root@bigdata01 soft]# mv jdk1.8.0_211/ jdk1.8

配置環境變數 JAVA_HOME

[root@bigdata01 soft]# vi /etc/profile

.....

export JAVA_HOME=/data/soft/jdk1.8

export PATH=.:$JAVA_HOME/bin:$PATH

使配置生效

[root@bigdata01 soft]# source /etc/profile

確認是否安裝成功
[root@bigdata01 soft]# java -version
java version "1.8.0_211"
Java(TM) SE Runtime Environment (build 1.8.0_211-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode)

7、安裝hadoop

上傳安裝包到/data/soft目錄下

檢視檔案

[root@bigdata01 soft]# ll
total 576608
-rw-r--r--. 1 root root 395448622 Jul 3 18:22 hadoop-3.2.2.tar.gz
drwxr-xr-x. 7 10 143 245 Apr 2 2019 jdk1.8
-rw-r--r--. 1 root root 194990602 Jul 3 18:04 jdk-8u211-linux-x64.tar.gz

解壓安裝包

[root@bigdata01 soft]# tar -zxvf hadoop-3.2.2.tar.gz

檢視目錄

[root@bigdata01 soft]# ll
total 576608
drwxr-xr-x. 9 1000 1000 149 Jan 3 18:11 hadoop-3.2.2
-rw-r--r--. 1 root root 395448622 Jul 3 18:22 hadoop-3.2.2.tar.gz
drwxr-xr-x. 7 10 143 245 Apr 2 2019 jdk1.8
-rw-r--r--. 1 root root 194990602 Jul 3 18:04 jdk-8u211-linux-x64.tar.gz
[root@bigdata01 soft]# cd hadoop-3.2.2
[root@bigdata01 hadoop-3.2.2]# ll
total 184
drwxr-xr-x. 2 1000 1000 203 Jan 3 18:11 bin
drwxr-xr-x. 3 1000 1000 20 Jan 3 17:29 etc
drwxr-xr-x. 2 1000 1000 106 Jan 3 18:11 include
drwxr-xr-x. 3 1000 1000 20 Jan 3 18:11 lib
drwxr-xr-x. 4 1000 1000 4096 Jan 3 18:11 libexec
-rw-rw-r--. 1 1000 1000 150569 Dec 5 2020 LICENSE.txt
-rw-rw-r--. 1 1000 1000 21943 Dec 5 2020 NOTICE.txt
-rw-rw-r--. 1 1000 1000 1361 Dec 5 2020 README.txt
drwxr-xr-x. 3 1000 1000 4096 Jan 3 17:29 sbin
drwxr-xr-x. 4 1000 1000 31 Jan 3 18:46 share

hadoop目錄下有兩個重要的目錄,一個是bin目錄,一個是sbin目錄。

先看一下bin目錄

[root@bigdata01 hadoop-3.2.2]# cd bin

[root@bigdata01 bin]# ll
total 1032
-rwxr-xr-x. 1 1000 1000 442728 Jan 3 17:54 container-executor
-rwxr-xr-x. 1 1000 1000 8707 Jan 3 17:28 hadoop
-rwxr-xr-x. 1 1000 1000 11265 Jan 3 17:28 hadoop.cmd
-rwxr-xr-x. 1 1000 1000 11274 Jan 3 17:32 hdfs
-rwxr-xr-x. 1 1000 1000 8081 Jan 3 17:32 hdfs.cmd
-rwxr-xr-x. 1 1000 1000 6237 Jan 3 17:57 mapred
-rwxr-xr-x. 1 1000 1000 6311 Jan 3 17:57 mapred.cmd
-rwxr-xr-x. 1 1000 1000 29184 Jan 3 17:54 oom-listener
-rwxr-xr-x. 1 1000 1000 485312 Jan 3 17:54 test-container-executor
-rwxr-xr-x. 1 1000 1000 12112 Jan 3 17:54 yarn
-rwxr-xr-x. 1 1000 1000 12840 Jan 3 17:54 yarn.cmd

裡面有hdfs,yarn等指令碼,這些指令碼後期主要是為了操作hadoop叢集中的hdfs 和yarn元件的。

再看一下sbin目錄,這裡面有很多start stop開頭的指令碼,這些指令碼是負責啟動 或者停止叢集中的元件的。

[root@bigdata01 hadoop-3.2.2]# cd sbin
[root@bigdata01 sbin]# ll
total 108
-rwxr-xr-x. 1 1000 1000 2756 Jan 3 17:32 distribute-exclude.sh
drwxr-xr-x. 4 1000 1000 36 Jan 3 17:54 FederationStateStore
-rwxr-xr-x. 1 1000 1000 1983 Jan 3 17:28 hadoop-daemon.sh
-rwxr-xr-x. 1 1000 1000 2522 Jan 3 17:28 hadoop-daemons.sh
-rwxr-xr-x. 1 1000 1000 1542 Jan 3 17:33 httpfs.sh
-rwxr-xr-x. 1 1000 1000 1500 Jan 3 17:29 kms.sh
-rwxr-xr-x. 1 1000 1000 1841 Jan 3 17:57 mr-jobhistory-daemon.sh
-rwxr-xr-x. 1 1000 1000 2086 Jan 3 17:32 refresh-namenodes.sh
-rwxr-xr-x. 1 1000 1000 1779 Jan 3 17:28 start-all.cmd
-rwxr-xr-x. 1 1000 1000 2221 Jan 3 17:28 start-all.sh
-rwxr-xr-x. 1 1000 1000 1880 Jan 3 17:32 start-balancer.sh
-rwxr-xr-x. 1 1000 1000 1401 Jan 3 17:32 start-dfs.cmd
-rwxr-xr-x. 1 1000 1000 5170 Jan 3 17:32 start-dfs.sh
-rwxr-xr-x. 1 1000 1000 1793 Jan 3 17:32 start-secure-dns.sh
-rwxr-xr-x. 1 1000 1000 1571 Jan 3 17:54 start-yarn.cmd
-rwxr-xr-x. 1 1000 1000 3342 Jan 3 17:54 start-yarn.sh
-rwxr-xr-x. 1 1000 1000 1770 Jan 3 17:28 stop-all.cmd
-rwxr-xr-x. 1 1000 1000 2166 Jan 3 17:28 stop-all.sh
-rwxr-xr-x. 1 1000 1000 1783 Jan 3 17:32 stop-balancer.sh
-rwxr-xr-x. 1 1000 1000 1455 Jan 3 17:32 stop-dfs.cmd
-rwxr-xr-x. 1 1000 1000 3898 Jan 3 17:32 stop-dfs.sh
-rwxr-xr-x. 1 1000 1000 1756 Jan 3 17:32 stop-secure-dns.sh
-rwxr-xr-x. 1 1000 1000 1642 Jan 3 17:54 stop-yarn.cmd
-rwxr-xr-x. 1 1000 1000 3083 Jan 3 17:54 stop-yarn.sh
-rwxr-xr-x. 1 1000 1000 1982 Jan 3 17:28 workers.sh
-rwxr-xr-x. 1 1000 1000 1814 Jan 3 17:54 yarn-daemon.sh
-rwxr-xr-x. 1 1000 1000 2328 Jan 3 17:54 yarn-daemons.sh

因為我們會用到bin目錄和sbin目錄下面的一些指令碼,為了方便使用,我們需要配置一下環境變數。

[root@bigdata01 sbin]# vi /etc/profile

.....

export JAVA_HOME=/data/soft/jdk1.8
export HADOOP_HOME=/data/soft/hadoop-3.2.2
export PATH=.:$JAVA_HOME/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$PATH

[root@bigdata01 sbin]# source /etc/profile

修改Hadoop相關配置檔案

進入配置檔案所在目錄

[root@bigdata01 hadoop-3.2.2]# cd etc/hadoop/

主要修改下面這幾個檔案:

hadoop-env.sh

core-site.xml

hdfs-site.xml

mapred-site.xml

yarn-site.xml

workers

首先修改hadoop-env.sh,增加環境變數,新增到檔案末尾即可。

[root@bigdata01 hadoop]# vi hadoop-env.sh

.......

export JAVA_HOME=/data/soft/jdk1.8

export HADOOP_LOG_DIR=/data/hadoop_repo/logs/hadoop

修改 core-site.xml 檔案

注意 fs.defaultFS 屬性中的主機名需要和你配置的主機名保持一致,新增到末尾

[root@bigdata01 hadoop]# vi core-site.xml

<configuration>
 <property>
  <name>fs.defaultFS</name>
  <value>hdfs://bigdata01:9000</value>
 </property>
 <property>
  <name>hadoop.tmp.dir</name>
  <value>/data/hadoop_repo</value>
 </property>
</configuration>

修改hdfs-site.xml檔案,把hdfs中檔案副本的數量設定為1,因為現在偽分佈叢集只有一個節點

[root@bigdata01 hadoop]# vi hdfs-site.xml
<configuration>
 <property>
  <name>dfs.replication</name>
  <value>1</value>
 </property>
</configuration>

修改mapred-site.xml,設定mapreduce使用的資源排程框架

<configuration>
 <property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
 </property>
</configuration>

修改yarn-site.xml,設定yarn上支援執行的服務和環境變數白名單

[root@bigdata01 hadoop]# vi yarn-site.xml
<configuration>
 <property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
 </property>
 <property>
  <name>yarn.nodemanager.env-whitelist</name> 
  <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
 </property>
</configuration>

修改workers,設定叢集中從節點的主機名資訊,在這裡就一臺叢集,所以就填寫bigdata01即可

[root@bigdata01 hadoop]# vi workers
bigdata01

配置檔案到這就修改好了,但是還不能直接啟動,因為Hadoop中的HDFS是一個分散式的檔案系統,檔案系統在使用之前是需要先格式化的,就類似我們買一塊新的磁碟,在安裝系統之前需要先格式化才可以 使用。

格式化HDFS

[root@bigdata01 hadoop]# cd /data/soft/hadoop-3.2.0
[root@bigdata01 hadoop-3.2.0]# bin/hdfs namenode -format
WARNING: /data/hadoop_repo/logs/hadoop does not exist. Creating.
2021-07-03 20:52:59,441 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = 192.168.184.128/192.168.184.128
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.2.2
STARTUP_MSG:   classpath = /data/soft/hadoop-3.2.2/etc/hadoop:/data/soft/hadoop-3.2.2/share/hadoop/common/lib/jetty-security-9.4.20.v20190813.jar:
.......
2021-07-03 20:53:01,097 INFO common.Storage: Storage directory /data/hadoop_repo/dfs/name has been successfully formatted.
2021-07-03 20:53:01,149 INFO namenode.FSImageFormatProtobuf: Saving image file /data/hadoop_repo/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2021-07-03 20:53:01,277 INFO namenode.FSImageFormatProtobuf: Image file /data/hadoop_repo/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 399 bytes saved in 0 seconds .
2021-07-03 20:53:01,291 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2021-07-03 20:53:01,301 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
2021-07-03 20:53:01,302 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at 192.168.184.128/192.168.184.128
************************************************************/

看到successfully formatted說明格式化成功了。 如果提示錯誤,一般都是因為配置檔案的問題,需要根據具體的報錯資訊去分析問題。

如果因為配置錯誤,需要重複執行格式化,需要把/data/hadoop_repo目錄中的內容全部刪除

[root@bigdata01 data]# cd /data
[root@bigdata01 data]# ls
hadoop_repo  soft
[root@bigdata01 data]# rm -r hadoop_repo/

再按之前格式步驟重新執行格式化

如果是返回成功資訊後就不可以再重新格式化了,否則叢集會出現問題。

啟動偽分佈叢集

執行sbin目錄下的start-all.sh指令碼

[root@bigdata01 hadoop-3.2.2]# sbin/start-all.sh 
Starting namenodes on [bigdata01]
ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.
Starting datanodes
ERROR: Attempting to operate on hdfs datanode as root
ERROR: but there is no HDFS_DATANODE_USER defined. Aborting operation.
Starting secondary namenodes [bigdata01]
ERROR: Attempting to operate on hdfs secondarynamenode as root
ERROR: but there is no HDFS_SECONDARYNAMENODE_USER defined. Aborting operation.
Starting resourcemanager
ERROR: Attempting to operate on yarn resourcemanager as root
ERROR: but there is no YARN_RESOURCEMANAGER_USER defined. Aborting operation.
Starting nodemanagers
ERROR: Attempting to operate on yarn nodemanager as root
ERROR: but there is no YARN_NODEMANAGER_USER defined. Aborting operation.

發現有很多ERROR資訊,提示缺少HDFS和YARN的一些使用者資訊。

解決方案如下: 修改sbin目錄下的 start-dfs.sh , stop-dfs.sh 這兩個指令碼檔案,在檔案前面增加如下內容

start-dfs.sh 檔案中# Start hadoop dfs daemons之前增加如下內容:

[root@bigdata01 sbin]# vi start-dfs.sh
HDFS_DATANODE_USER=root HDFS_DATANODE_SECURE_USER=hdfs HDFS_NAMENODE_USER=root HDFS_SECONDARYNAMENODE_USER=root

stop-dfs.sh檔案中# Stop hadoop dfs daemons之前增加如下內容:

[root@bigdata01 sbin]# vi stop-dfs.sh
HDFS_DATANODE_USER=root HDFS_DATANODE_SECURE_USER=hdfs HDFS_NAMENODE_USER=root HDFS_SECONDARYNAMENODE_USER=root

修改sbin目錄下的start-yarn.sh , stop-yarn.sh 這兩個指令碼檔案,在檔案前面增加如下內容

start-yarn.sh檔案中## @description usage info之前增加

[root@bigdata01 sbin]# vi start-yarn.sh
YARN_RESOURCEMANAGER_USER=root HADOOP_SECURE_DN_USER=yarn YARN_NODEMANAGER_USER=root

stop-yarn.sh檔案中## @description usage info之前增加

[root@bigdata01 sbin]# vi stop-yarn.sh
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root

重新啟動偽分佈叢集

[root@bigdata01 sbin]# cd /data/soft/hadoop-3.2.2
[root@bigdata01 hadoop-3.2.2]# sbin/start-all.sh 
Starting namenodes on [bigdata01]
Last login: Sat Jul  3 17:53:46 CST 2021 from fe80::3c67:9a76:5a25:c4d9%ens33 on pts/3
Starting datanodes
Last login: Sat Jul  3 21:25:38 CST 2021 on pts/2
Starting secondary namenodes [bigdata01]
Last login: Sat Jul  3 21:25:40 CST 2021 on pts/2
Starting resourcemanager
Last login: Sat Jul  3 21:25:48 CST 2021 on pts/2
Starting nodemanagers
Last login: Sat Jul  3 21:25:58 CST 2021 on pts/2

驗證叢集程序資訊 執行jps命令可以檢視叢集的程序資訊,去掉Jps這個程序之外還需要有5個程序才說明叢集是正常啟動的

[root@bigdata01 hadoop-3.2.2]# jps
2561 DataNode
2403 NameNode
3448 Jps
3115 NodeManager
2751 SecondaryNameNode
2991 ResourceManager

還可以通過webui介面來驗證叢集服務是否正常

HDFS webui介面:http://192.168.184.128:9870

YARN webui介面:http://192.168.184.128:8088

如果想通過主機名訪問,則需要修改windows機器中的hosts檔案.

檔案所在位置為:C:\Windows\System32\drivers\etc\HOSTS 在檔案中增加下面內容,這個其實就是Linux虛擬機器的ip和主機名,在這裡做一個對映之後,就可以在 Windows機器中通過主機名訪問這個Linux虛擬機器了。

停止叢集

如果修改了叢集的配置檔案或者是其它原因要停止叢集,可以使用下面命令

[root@bigdata01 hadoop-3.2.0]# sbin/stop-all.sh 
Stopping namenodes on [bigdata01]
Last login: Tue Apr 7 17:59:40 CST 2020 on pts/0
Stopping datanodes
Last login: Tue Apr 7 18:06:09 CST 2020 on pts/0
Stopping secondary namenodes [bigdata01]
Last login: Tue Apr 7 18:06:10 CST 2020 on pts/0
Stopping nodemanagers
Last login: Tue Apr 7 18:06:13 CST 2020 on pts/0
Stopping resourcemanager
Last login: Tue Apr 7 18:06:16 CST 2020 on pts/0