大資料實戰(十二):電商數倉(五)之使用者行為資料採集(五)元件安裝(一)之hadoop安裝
1)叢集規劃:
伺服器hadoop102 |
伺服器hadoop103 |
伺服器hadoop104 |
|
HDFS |
NameNode DataNode |
DataNode |
DataNode SecondaryNameNode |
Yarn |
NodeManager |
Resourcemanager NodeManager |
NodeManager |
注意:儘量使用離線方式安裝
1專案經驗之HDFS儲存多目錄
若HDFS儲存空間緊張,需要對DataNode進行磁碟擴充套件。
1)在DataNode節點增加磁碟並進行掛載。
2)在hdfs-site.xml檔案中配置多目錄,
<property> <name>dfs.datanode.data.dir</name> <value>file:///${hadoop.tmp.dir}/dfs/data1,file:///hd2/dfs/data2,file:///hd3/dfs/data3,file:///hd4/dfs/data4</value> </property>
1.2專案經驗之支援LZO壓縮配置
1)hadoop本身並不支援lzo壓縮,故需要使用twitter提供的hadoop-lzo開源元件。hadoop-lzo需依賴hadoop和lzo
Hadoop支援LZO 0. 環境準備 maven(下載安裝,配置環境變數,修改sitting.xml加阿里雲映象) gcc-c++ zlib-devel autoconf automake libtool 通過yum安裝即可,yum -y install gcc-c++ lzo-devel zlib-devel autoconf automake libtool 1. 下載、安裝並編譯LZO wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.10.tar.gz tar -zxvf lzo-2.10.tar.gz cd lzo-2.10 ./configure -prefix=/usr/local/hadoop/lzo/ make make install 2. 編譯hadoop-lzo原始碼 2.1 下載hadoop-lzo的原始碼,下載地址:https://github.com/twitter/hadoop-lzo/archive/master.zip 2.2 解壓之後,修改pom.xml <hadoop.current.version>2.7.2</hadoop.current.version> 2.3 宣告兩個臨時環境變數 export C_INCLUDE_PATH=/usr/local/hadoop/lzo/include export LIBRARY_PATH=/usr/local/hadoop/lzo/lib 2.4 編譯 進入hadoop-lzo-master,執行maven編譯命令 mvn package -Dmaven.test.skip=true 2.5 進入target,hadoop-lzo-0.4.21-SNAPSHOT.jar 即編譯成功的hadoop-lzo元件
2)將編譯好後的hadoop-lzo-0.4.20.jar 放入hadoop-2.7.2/share/hadoop/common/
[atguigu@hadoop102 common]$ pwd /opt/module/hadoop-2.7.2/share/hadoop/common [atguigu@hadoop102 common]$ ls hadoop-lzo-0.4.20.jar
3)同步hadoop-lzo-0.4.20.jar到hadoop103、hadoop104
[atguigu@hadoop102 common]$ xsync hadoop-lzo-0.4.20.jar
4)core-site.xml增加配置支援LZO壓縮
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>io.compression.codecs</name> <value> org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.SnappyCodec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec </value> </property> <property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property> </configuration>
5)同步core-site.xml到hadoop103、hadoop104
[atguigu@hadoop102 hadoop]$ xsync core-site.xml
6)啟動及檢視叢集
[atguigu@hadoop102 hadoop-2.7.2]$ sbin/start-dfs.sh [atguigu@hadoop103 hadoop-2.7.2]$ sbin/start-yarn.sh
7)測試
yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec /input /output
8)為lzo檔案建立索引
hadoop jar ./share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /output
1.3專案經驗之基準測試
1)測試HDFS寫效能
測試內容:向HDFS叢集寫10個128M的檔案
[atguigu@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.2-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 128MB 19/05/02 11:44:26 INFO fs.TestDFSIO: TestDFSIO.1.8 19/05/02 11:44:26 INFO fs.TestDFSIO: nrFiles = 10 19/05/02 11:44:26 INFO fs.TestDFSIO: nrBytes (MB) = 128.0 19/05/02 11:44:26 INFO fs.TestDFSIO: bufferSize = 1000000 19/05/02 11:44:26 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO 19/05/02 11:44:28 INFO fs.TestDFSIO: creating control file: 134217728 bytes, 10 files 19/05/02 11:44:30 INFO fs.TestDFSIO: created control files for: 10 files 19/05/02 11:44:30 INFO client.RMProxy: Connecting to ResourceManager at hadoop103/192.168.1.103:8032 19/05/02 11:44:31 INFO client.RMProxy: Connecting to ResourceManager at hadoop103/192.168.1.103:8032 19/05/02 11:44:32 INFO mapred.FileInputFormat: Total input paths to process : 10 19/05/02 11:44:32 INFO mapreduce.JobSubmitter: number of splits:10 19/05/02 11:44:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1556766549220_0003 19/05/02 11:44:34 INFO impl.YarnClientImpl: Submitted application application_1556766549220_0003 19/05/02 11:44:34 INFO mapreduce.Job: The url to track the job: http://hadoop103:8088/proxy/application_1556766549220_0003/ 19/05/02 11:44:34 INFO mapreduce.Job: Running job: job_1556766549220_0003 19/05/02 11:44:47 INFO mapreduce.Job: Job job_1556766549220_0003 running in uber mode : false 19/05/02 11:44:47 INFO mapreduce.Job: map 0% reduce 0% 19/05/02 11:45:05 INFO mapreduce.Job: map 13% reduce 0% 19/05/02 11:45:06 INFO mapreduce.Job: map 27% reduce 0% 19/05/02 11:45:08 INFO mapreduce.Job: map 43% reduce 0% 19/05/02 11:45:09 INFO mapreduce.Job: map 60% reduce 0% 19/05/02 11:45:10 INFO mapreduce.Job: map 73% reduce 0% 19/05/02 11:45:15 INFO mapreduce.Job: map 77% reduce 0% 19/05/02 11:45:18 INFO mapreduce.Job: map 87% reduce 0% 19/05/02 11:45:19 INFO mapreduce.Job: map 100% reduce 0% 19/05/02 11:45:21 INFO mapreduce.Job: map 100% reduce 100% 19/05/02 11:45:22 INFO mapreduce.Job: Job job_1556766549220_0003 completed successfully 19/05/02 11:45:22 INFO mapreduce.Job: Counters: 51 File System Counters FILE: Number of bytes read=856 FILE: Number of bytes written=1304826 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=2350 HDFS: Number of bytes written=1342177359 HDFS: Number of read operations=43 HDFS: Number of large read operations=0 HDFS: Number of write operations=12 Job Counters Killed map tasks=1 Launched map tasks=10 Launched reduce tasks=1 Data-local map tasks=8 Rack-local map tasks=2 Total time spent by all maps in occupied slots (ms)=263635 Total time spent by all reduces in occupied slots (ms)=9698 Total time spent by all map tasks (ms)=263635 Total time spent by all reduce tasks (ms)=9698 Total vcore-milliseconds taken by all map tasks=263635 Total vcore-milliseconds taken by all reduce tasks=9698 Total megabyte-milliseconds taken by all map tasks=269962240 Total megabyte-milliseconds taken by all reduce tasks=9930752 Map-Reduce Framework Map input records=10 Map output records=50 Map output bytes=750 Map output materialized bytes=910 Input split bytes=1230 Combine input records=0 Combine output records=0 Reduce input groups=5 Reduce shuffle bytes=910 Reduce input records=50 Reduce output records=5 Spilled Records=100 Shuffled Maps =10 Failed Shuffles=0 Merged Map outputs=10 GC time elapsed (ms)=17343 CPU time spent (ms)=96930 Physical memory (bytes) snapshot=2821341184 Virtual memory (bytes) snapshot=23273218048 Total committed heap usage (bytes)=2075656192 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=1120 File Output Format Counters Bytes Written=79 19/05/02 11:45:23 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write 19/05/02 11:45:23 INFO fs.TestDFSIO: Date & time: Thu May 02 11:45:23 CST 2019 19/05/02 11:45:23 INFO fs.TestDFSIO: Number of files: 10 19/05/02 11:45:23 INFO fs.TestDFSIO: Total MBytes processed: 1280.0 19/05/02 11:45:23 INFO fs.TestDFSIO: Throughput mb/sec: 10.69751115716984 19/05/02 11:45:23 INFO fs.TestDFSIO: Average IO rate mb/sec: 14.91699504852295 19/05/02 11:45:23 INFO fs.TestDFSIO: IO rate std deviation: 11.160882132355928 19/05/02 11:45:23 INFO fs.TestDFSIO: Test exec time sec: 52.315
2)測試HDFS讀效能
測試內容:讀取HDFS叢集10個128M的檔案
[atguigu@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.2-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 128MB 19/05/02 11:55:42 INFO fs.TestDFSIO: TestDFSIO.1.8 19/05/02 11:55:42 INFO fs.TestDFSIO: nrFiles = 10 19/05/02 11:55:42 INFO fs.TestDFSIO: nrBytes (MB) = 128.0 19/05/02 11:55:42 INFO fs.TestDFSIO: bufferSize = 1000000 19/05/02 11:55:42 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO 19/05/02 11:55:45 INFO fs.TestDFSIO: creating control file: 134217728 bytes, 10 files 19/05/02 11:55:47 INFO fs.TestDFSIO: created control files for: 10 files 19/05/02 11:55:47 INFO client.RMProxy: Connecting to ResourceManager at hadoop103/192.168.1.103:8032 19/05/02 11:55:48 INFO client.RMProxy: Connecting to ResourceManager at hadoop103/192.168.1.103:8032 19/05/02 11:55:49 INFO mapred.FileInputFormat: Total input paths to process : 10 19/05/02 11:55:49 INFO mapreduce.JobSubmitter: number of splits:10 19/05/02 11:55:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1556766549220_0004 19/05/02 11:55:50 INFO impl.YarnClientImpl: Submitted application application_1556766549220_0004 19/05/02 11:55:50 INFO mapreduce.Job: The url to track the job: http://hadoop103:8088/proxy/application_1556766549220_0004/ 19/05/02 11:55:50 INFO mapreduce.Job: Running job: job_1556766549220_0004 19/05/02 11:56:04 INFO mapreduce.Job: Job job_1556766549220_0004 running in uber mode : false 19/05/02 11:56:04 INFO mapreduce.Job: map 0% reduce 0% 19/05/02 11:56:24 INFO mapreduce.Job: map 7% reduce 0% 19/05/02 11:56:27 INFO mapreduce.Job: map 23% reduce 0% 19/05/02 11:56:28 INFO mapreduce.Job: map 63% reduce 0% 19/05/02 11:56:29 INFO mapreduce.Job: map 73% reduce 0% 19/05/02 11:56:30 INFO mapreduce.Job: map 77% reduce 0% 19/05/02 11:56:31 INFO mapreduce.Job: map 87% reduce 0% 19/05/02 11:56:32 INFO mapreduce.Job: map 100% reduce 0% 19/05/02 11:56:35 INFO mapreduce.Job: map 100% reduce 100% 19/05/02 11:56:36 INFO mapreduce.Job: Job job_1556766549220_0004 completed successfully 19/05/02 11:56:36 INFO mapreduce.Job: Counters: 51 File System Counters FILE: Number of bytes read=852 FILE: Number of bytes written=1304796 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1342179630 HDFS: Number of bytes written=78 HDFS: Number of read operations=53 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Killed map tasks=1 Launched map tasks=10 Launched reduce tasks=1 Data-local map tasks=8 Rack-local map tasks=2 Total time spent by all maps in occupied slots (ms)=233690 Total time spent by all reduces in occupied slots (ms)=7215 Total time spent by all map tasks (ms)=233690 Total time spent by all reduce tasks (ms)=7215 Total vcore-milliseconds taken by all map tasks=233690 Total vcore-milliseconds taken by all reduce tasks=7215 Total megabyte-milliseconds taken by all map tasks=239298560 Total megabyte-milliseconds taken by all reduce tasks=7388160 Map-Reduce Framework Map input records=10 Map output records=50 Map output bytes=746 Map output materialized bytes=906 Input split bytes=1230 Combine input records=0 Combine output records=0 Reduce input groups=5 Reduce shuffle bytes=906 Reduce input records=50 Reduce output records=5 Spilled Records=100 Shuffled Maps =10 Failed Shuffles=0 Merged Map outputs=10 GC time elapsed (ms)=6473 CPU time spent (ms)=57610 Physical memory (bytes) snapshot=2841436160 Virtual memory (bytes) snapshot=23226683392 Total committed heap usage (bytes)=2070413312 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=1120 File Output Format Counters Bytes Written=78 19/05/02 11:56:36 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read 19/05/02 11:56:36 INFO fs.TestDFSIO: Date & time: Thu May 02 11:56:36 CST 2019 19/05/02 11:56:36 INFO fs.TestDFSIO: Number of files: 10 19/05/02 11:56:36 INFO fs.TestDFSIO: Total MBytes processed: 1280.0 19/05/02 11:56:36 INFO fs.TestDFSIO: Throughput mb/sec: 16.001000062503905 19/05/02 11:56:36 INFO fs.TestDFSIO: Average IO rate mb/sec: 17.202795028686523 19/05/02 11:56:36 INFO fs.TestDFSIO: IO rate std deviation: 4.881590515873911 19/05/02 11:56:36 INFO fs.TestDFSIO: Test exec time sec: 49.116 19/05/02 11:56:36 INFO fs.TestDFSIO:
3)刪除測試生成資料
[atguigu@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.2-tests.jar TestDFSIO -clean
4)使用Sort程式評測MapReduce
(1)使用RandomWriter來產生隨機數,每個節點執行10個Map任務,每個Map產生大約1G大小的二進位制隨機數
[atguigu@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar randomwriter random-data
(2)執行Sort程式
[atguigu@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar sort random-data sorted-data
(3)驗證資料是否真正排好序了
[atguigu@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar testmapredsort -sortInput random-data -sortOutput sorted-data
1.4專案經驗之Hadoop引數調優
1)HDFS引數調優hdfs-site.xml
(1)dfs.namenode.handler.count=20 * log2(Cluster Size),比如叢集規模為8臺時,此引數設定為60
The number of Namenode RPC server threads that listen to requests from clients. If dfs.namenode.servicerpc-address is not configured then Namenode RPC server threads listen to requests from all nodes. NameNode有一個工作執行緒池,用來處理不同DataNode的併發心跳以及客戶端併發的元資料操作。對於大叢集或者有大量客戶端的叢集來說,通常需要增大引數dfs.namenode.handler.count的預設值10。設定該值的一般原則是將其設定為叢集大小的自然對數乘以20,即20logN,N為叢集大小。
(2)編輯日誌儲存路徑dfs.namenode.edits.dir設定與映象檔案儲存路徑dfs.namenode.name.dir儘量分開,達到最低寫入延遲
2)YARN引數調優yarn-site.xml
(1)情景描述:總共7臺機器,每天幾億條資料,資料來源->Flume->Kafka->HDFS->Hive
面臨問題:資料統計主要用HiveSQL,沒有資料傾斜,小檔案已經做了合併處理,開啟的JVM重用,而且IO沒有阻塞,記憶體用了不到50%。但是還是跑的非常慢,而且資料量洪峰過來時,整個叢集都會宕掉。基於這種情況有沒有優化方案。
(2)解決辦法:
記憶體利用率不夠。這個一般是Yarn的2個配置造成的,單個任務可以申請的最大記憶體大小,和Hadoop單個節點可用記憶體大小。調節這兩個引數能提高系統記憶體的利用率。
(a)yarn.nodemanager.resource.memory-mb
表示該節點上YARN可使用的實體記憶體總量,預設是8192(MB),注意,如果你的節點記憶體資源不夠8GB,則需要調減小這個值,而YARN不會智慧的探測節點的實體記憶體總量。
(b)yarn.scheduler.maximum-allocation-mb
單個任務可申請的最多實體記憶體量,預設是8192(MB)。
3)Hadoop宕機
(1)如果MR造成系統宕機。此時要控制Yarn同時執行的任務數,和每個任務申請的最大記憶體。調整引數:yarn.scheduler.maximum-allocation-mb(單個任務可申請的最多實體記憶體量,預設是8192MB)
(2)如果寫入檔案過量造成NameNode宕機。那麼調高Kafka的儲存大小,控制從Kafka到HDFS的寫入速度。高峰期的時候用Kafka進行快取,高峰期過去資料同步會自動跟上。