大資料之hadoop（檔案系統HDFS）

阿新 • • 發佈：2019-01-09

一 HDFS概述

1.1 概念

HDFS，它是一個檔案系統，用於儲存檔案，通過目錄樹來定位檔案；其次，它是分散式的，由很多伺服器聯合起來實現其功能，叢集中的伺服器有各自的角色。

HDFS的設計適合一次寫入，多次讀出的場景，且不支援檔案的修改。適合用來做資料分析，並不適合用來做網盤應用。

1.2 組成

1）HDFS叢集包括，NameNode和DataNode以及Secondary Namenode。

2）NameNode負責管理整個檔案系統的元資料，以及每一個路徑（檔案）所對應的資料塊資訊。

3）DataNode 負責管理使用者的檔案資料塊，每一個數據塊都可以在多個datanode上儲存多個副本。

4）SecondaryNameNode用來監控HDFS狀態的輔助後臺程式，每隔一段時間獲取HDFS元資料的快照。

1.3 HDFS 檔案塊大小

HDFS中的檔案在物理上是分塊儲存（block），塊的大小可以通過配置引數( dfs.blocksize)來規定，預設大小在hadoop2.x版本中是128M，老版本中是64M

HDFS的塊比磁碟的塊大，其目的是為了最小化定址開銷。如果塊設定得足夠大，從磁碟傳輸資料的時間會明顯大於定位這個塊開始位置所需的時間。因而，傳輸一個由多個塊組成的檔案的時間取決於磁碟傳輸速率。

如果定址時間約為10ms，而傳輸速率為100MB/s，為了使定址時間僅佔傳輸時間的1%，我們要將塊大小設定約為100MB。預設的塊大小128MB。

塊的大小：10ms*100*100M/s = 100M

二 HFDS命令列操作

1）基本語法

bin/hadoop fs 具體命令

2）引數大全

bin/hadoop fs

[-appendToFile <localsrc> ... <dst>]

[-cat [-ignoreCrc] <src> ...]

[-checksum <src> ...]

[-chgrp [-R] GROUP PATH...]

[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]

[-chown [-R] [OWNER][:[GROUP]] PATH...]

[-copyFromLocal [-f] [-p] <localsrc> ... <dst>]

[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]

[-count [-q] <path> ...]

[-cp [-f] [-p] <src> ... <dst>]

[-createSnapshot <snapshotDir> [<snapshotName>]]

[-deleteSnapshot <snapshotDir> <snapshotName>]

[-df [-h] [<path> ...]]

[-du [-s] [-h] <path> ...]

[-expunge]

[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]

[-getfacl [-R] <path>]

[-getmerge [-nl] <src> <localdst>]

[-help [cmd ...]]

[-ls [-d] [-h] [-R] [<path> ...]]

[-mkdir [-p] <path> ...]

[-moveFromLocal <localsrc> ... <dst>]

[-moveToLocal <src> <localdst>]

[-mv <src> ... <dst>]

[-put [-f] [-p] <localsrc> ... <dst>]

[-renameSnapshot <snapshotDir> <oldName> <newName>]

[-rm [-f] [-r|-R] [-skipTrash] <src> ...]

[-rmdir [--ignore-fail-on-non-empty] <dir> ...]

[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]

[-setrep [-R] [-w] <rep> <path> ...]

[-stat [format] <path> ...]

[-tail [-f] <file>]

[-test -[defsz] <path>]

[-text [-ignoreCrc] <src> ...]

[-touchz <path> ...]

[-usage [cmd ...]]

3）常用命令實操

（1）-help：輸出這個命令引數

bin/hdfs dfs -help rm

（2）-ls: 顯示目錄資訊

hadoopfs -ls /

（3）-mkdir：在hdfs上建立目錄

hadoopfs -mkdir -p /aaa/bbb/cc/dd

（4）-moveFromLocal從本地剪下貼上到hdfs

hadoop fs -moveFromLocal /home/hadoop/a.txt /aaa/bbb/cc/dd

（5）-moveToLocal：從hdfs剪下貼上到本地（尚未實現）

[[email protected]]$ hadoop fs -help moveToLocal

-moveToLocal<src> <localdst> :

Not implemented yet

（6）--appendToFile ：追加一個檔案到已經存在的檔案末尾

hadoop fs -appendToFile ./hello.txt /hello.txt

（7）-cat：顯示檔案內容

（8）-tail：顯示一個檔案的末尾

hadoop fs -tail /weblog/access_log.1

（9）-chgrp、-chmod、-chown：linux檔案系統中的用法一樣，修改檔案所屬許可權

hadoop fs -chmod 666 /hello.txt

hadoop fs -chown someuser:somegrp /hello.txt

（10）-copyFromLocal：從本地檔案系統中拷貝檔案到hdfs路徑去

hadoop fs -copyFromLocal ./jdk.tar.gz /aaa/

（11）-copyToLocal：從hdfs拷貝到本地

hadoopfs -copyToLocal /user/hello.txt ./hello.txt

（12）-cp：從hdfs的一個路徑拷貝到hdfs的另一個路徑

hadoop fs -cp /aaa/jdk.tar.gz /bbb/jdk.tar.gz.2

（13）-mv：在hdfs目錄中移動檔案

hadoop fs -mv /aaa/jdk.tar.gz /

（14）-get：等同於copyToLocal，就是從hdfs下載檔案到本地

hadoopfs -get /user/hello.txt ./

（15）-getmerge ：合併下載多個檔案，比如hdfs的目錄/aaa/下有多個檔案:log.1, log.2,log.3,...

hadoopfs -getmerge /aaa/log.* ./log.sum

（16）-put：等同於copyFromLocal

hadoop fs -put /aaa/jdk.tar.gz /bbb/jdk.tar.gz.2

（17）-rm：刪除檔案或資料夾

hadoopfs -rm -r /aaa/bbb/

（18）-rmdir：刪除空目錄

hadoop fs -rmdir /aaa/bbb/ccc

（19）-df：統計檔案系統的可用空間資訊

hadoop fs -df -h /

（20）-du統計資料夾的大小資訊

[[email protected]]$ hadoop fs -du -s -h /user/jduser/wcinput

188.5M /user/jduser/wcinput

[[email protected]]$ hadoop fs -du -h /user/jduser/wcinput

188.5M /user/jduser/wcinput/hadoop-2.7.2.tar.gz

97 /user/jduser/wcinput/wc.input

（21）-count：統計一個指定目錄下的檔案節點數量

hadoopfs -count /aaa/

[[email protected]]$ hadoop fs -count /user/jduser/wcinput

1 2 197657784 /user/jduser/wcinput

巢狀檔案層級；包含檔案的總數

（22）-setrep：設定hdfs中檔案的副本數量

hadoopfs -setrep 3 /aaa/jdk.tar.gz

這裡設定的副本數只是記錄在namenode的元資料中，是否真的會有這麼多副本，還得看datanode的數量。因為目前只有3臺裝置，最多也就3個副本，只有節點數的增加到10臺時，副本數才能達到10。

三 HDFS客戶端操作

3.1 eclipse環境準備

3.1.1 jar包準備

1）解壓hadoop-2.7.2.tar.gz到非中文目錄

2）進入share資料夾，查詢所有jar包，並把jar包拷貝到_lib資料夾下

3）在全部jar包中查詢sources.jar，並剪下到_source資料夾。

4）在全部jar包中查詢tests.jar，並剪下到_test資料夾。

3.1.2 eclipse準備

1）配置HADOOP_HOME環境變數

2）採用hadoop編譯後的bin 、lib兩個資料夾（如果不生效，重新啟動eclipse）

3）建立第一個java工程

public class HdfsClientDemo1 {

public static void main(String[] args) throws Exception {

// 1 獲取檔案系統

Configuration configuration = new Configuration();

// 配置在叢集上執行

configuration.set("fs.defaultFS", "hdfs://hadoop102:9000");

FileSystem fileSystem = FileSystem.get(configuration);

// 直接配置訪問叢集的路徑和訪問叢集的使用者名稱稱

// FileSystem fileSystem = FileSystem.get(new URI("hdfs://hadoop102:9000"),configuration, "jduser");

// 2 把本地檔案上傳到檔案系統中

fileSystem.copyFromLocalFile(new Path("f:/hello.txt"), new Path("/hello1.copy.txt"));

// 3 關閉資源

fileSystem.close();

System.out.println("over");

}

4）執行程式

執行時需要配置使用者名稱稱

客戶端去操作hdfs時，是有一個使用者身份的。預設情況下，hdfs客戶端api會從jvm中獲取一個引數來作為自己的使用者身份：-DHADOOP_USER_NAME=jduser，jduser為使用者名稱稱。

3.2 通過API操作HDFS

3.2.1 HDFS獲取檔案系統

1）詳細程式碼

@Test

public void initHDFS() throws Exception{

// 1 建立配置資訊物件

Configuration configuration = new Configuration();

// 2 獲取檔案系統

FileSystem fs = FileSystem.get(configuration);

// 3 列印檔案系統

System.out.println(fs.toString());

}

3.2.2 HDFS檔案上傳

@Test

public void putFileToHDFS() throws Exception{

// 1 建立配置資訊物件

// new Configuration();的時候，它就會去載入jar包中的hdfs-default.xml

// 然後再載入classpath下的hdfs-site.xml

Configuration configuration = new Configuration();

// 2 設定引數

// 引數優先順序： 1、客戶端程式碼中設定的值 2、classpath下的使用者自定義配置檔案 3、然後是伺服器的預設配置

configuration.set("dfs.replication", "2");

FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"),configuration, "jduser");

// 3 建立要上傳檔案所在的本地路徑

Path src = new Path("e:/hello.txt");

// 4 建立要上傳到hdfs的目標路徑

Path dst = new Path("hdfs://hadoop102:9000/user/jduser/hello.txt");

// 5 拷貝檔案

fs.copyFromLocalFile(src, dst);

fs.close();

}

2）將core-site.xml拷貝到專案的根目錄下

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>fs.defaultFS</name>

<value>hdfs://hadoop102:9000</value>

</property>

</configuration>

3）將hdfs-site.xml拷貝到專案的根目錄下

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>dfs.replication</name>

</property>

</configuration>

4）測試引數優先順序

引數優先順序： 1、客戶端程式碼中設定的值 2、classpath下的使用者自定義配置檔案 3、然後是伺服器的預設配置

3.2.3 HDFS檔案下載

@Test

public void getFileFromHDFS() throws Exception{

// 1 建立配置資訊物件

Configuration configuration = new Configuration();

FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"),configuration, "jduser");

// fs.copyToLocalFile(new Path("hdfs://hadoop102:9000/user/jduser/hello.txt"), new Path("d:/hello.txt"));

// boolean delSrc 指是否將原檔案刪除

// Path src 指要下載的檔案路徑

// Path dst 指將檔案下載到的路徑

// boolean useRawLocalFileSystem 是否開啟檔案效驗

// 2 下載檔案

fs.copyToLocalFile(false, new Path("hdfs://hadoop102:9000/user/jduser/hello.txt"), new Path("e:/hellocopy.txt"), true);

fs.close();

}

3.2.4 HDFS目錄建立

@Test

public void mkdirAtHDFS() throws Exception{

// 1 建立配置資訊物件

Configuration configuration = new Configuration();

FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"),configuration, "jduser");

//2 建立目錄

fs.mkdirs(new Path("hdfs://hadoop102:9000/user/jduser/output"));

}

3.2.5 HDFS資料夾刪除

@Test

public void deleteAtHDFS() throws Exception{

// 1 建立配置資訊物件

Configuration configuration = new Configuration();

FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"),configuration, "jduser");

//2 刪除資料夾，如果是非空資料夾，引數2是否遞迴刪除，true遞迴

fs.delete(new Path("hdfs://hadoop102:9000/user/jduser/output"), true);

}

3.2.6 HDFS檔名更改

@Test

public void renameAtHDFS() throws Exception{

// 1 建立配置資訊物件

Configuration configuration = new Configuration();

FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"),configuration, "jduser");

//2 重新命名檔案或資料夾

fs.rename(new Path("hdfs://hadoop102:9000/user/jduser/hello.txt"), new Path("hdfs://hadoop102:9000/user/jduser/hellonihao.txt"));

}

3.2.7 HDFS檔案詳情檢視

檢視檔名稱、許可權、長度、塊資訊

@Test

public void readListFiles() throws Exception {

// 1 建立配置資訊物件

Configuration configuration = new Configuration();

FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"),configuration, "jduser");

// 思考：為什麼返回迭代器，而不是List之類的容器

RemoteIterator<LocatedFileStatus> listFiles = fs.listFiles(new Path("/"), true);

while (listFiles.hasNext()) {

LocatedFileStatus fileStatus = listFiles.next();

System.out.println(fileStatus.getPath().getName());

System.out.println(fileStatus.getBlockSize());

System.out.println(fileStatus.getPermission());

System.out.println(fileStatus.getLen());

BlockLocation[] blockLocations = fileStatus.getBlockLocations();

for (BlockLocation bl : blockLocations) {

System.out.println("block-offset:" + bl.getOffset());

String[] hosts = bl.getHosts();

for (String host : hosts) {

System.out.println(host);

}

System.out.println("--------------李冰冰的分割線--------------");

}

3.2.8 HDFS檔案和資料夾判斷

@Test

public void findAtHDFS() throws Exception, IllegalArgumentException, IOException{

// 1 建立配置資訊物件

Configuration configuration = new Configuration();

FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"),configuration, "jduser");

// 2 獲取查詢路徑下的檔案狀態資訊

FileStatus[] listStatus = fs.listStatus(new Path("/"));

// 3 遍歷所有檔案狀態

for (FileStatus status : listStatus) {

if (status.isFile()) {

System.out.println("f--" + status.getPath().getName());

} else {

System.out.println("d--" + status.getPath().getName());

}

3.3 通過IO流操作HDFS

3.3.1 HDFS檔案上傳

@Test

public void putFileToHDFS() throws Exception{

// 1 建立配置資訊物件

Configuration configuration = new Configuration();

FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"),configuration, "jduser");

// 2 建立輸入流

FileInputStream inStream = new FileInputStream(new File("e:/hello.txt"));

// 3 獲取輸出路徑

String putFileName = "hdfs://hadoop102:9000/user/jduser/hello1.txt";

Path writePath = new Path(putFileName);

// 4 建立輸出流

FSDataOutputStream outStream = fs.create(writePath);

// 5 流對接

try{

IOUtils.copyBytes(inStream, outStream, 4096, false);

}catch(Exception e){

e.printStackTrace();

}finally{

IOUtils.closeStream(inStream);

IOUtils.closeStream(outStream);

}

3.3.2 HDFS檔案下載

@Test

public void getFileToHDFS() throws Exception{

// 1 建立配置資訊物件

Configuration configuration = new Configuration();

FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"),configuration, "jduser");

// 2 獲取讀取檔案路徑

String filename = "hdfs://hadoop102:9000/user/jduser/hello1.txt";

// 3 建立讀取path

Path readPath = new Path(filename);

// 4 建立輸入流

FSDataInputStream inStream = fs.open(readPath);

// 5 流對接輸出到控制檯

try{

IOUtils.copyBytes(inStream, System.out, 4096, false);

}catch(Exception e){

e.printStackTrace();

}finally{

IOUtils.closeStream(inStream);

}

3.3.3 定位檔案讀取

1）下載第一塊

@Test

// 定位下載第一塊內容

public void readFileSeek1() throws Exception {

// 1 建立配置資訊物件

Configuration configuration = new Configuration();

FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"), configuration, "jduser");

// 2 獲取輸入流路徑

Path path = new Path("hdfs://hadoop102:9000/user/jduser/tmp/hadoop-2.7.2.tar.gz");

// 3 開啟輸入流

FSDataInputStream fis = fs.open(path);

// 4 建立輸出流

FileOutputStream fos = new FileOutputStream("e:/hadoop-2.7.2.tar.gz.part1");

// 5 流對接

byte[] buf = new byte[1024];

for (int i = 0; i < 128 * 1024; i++) {

fis.read(buf);

fos.write(buf);

}

// 6 關閉流

IOUtils.closeStream(fis);

IOUtils.closeStream(fos);

}

2）下載第二塊

@Test

// 定位下載第二塊內容

public void readFileSeek2() throws Exception{

// 1 建立配置資訊物件

Configuration configuration = new Configuration();

FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"), configuration, "jduser");

// 2 獲取輸入流路徑

Path path = new Path("hdfs://hadoop102:9000/user/jduser/tmp/hadoop-2.7.2.tar.gz");

// 3 開啟輸入流

FSDataInputStream fis = fs.open(path);

// 4 建立輸出流

FileOutputStream fos = new FileOutputStream("e:/hadoop-2.7.2.tar.gz.part2");

// 5 定位偏移量（第二塊的首位）

fis.seek(1024 * 1024 * 128);

// 6 流對接

IOUtils.copyBytes(fis, fos, 1024);

// 7 關閉流

IOUtils.closeStream(fis);

IOUtils.closeStream(fos);

}

3）合併檔案

在window命令視窗中執行

type hadoop-2.7.2.tar.gz.part2 >> hadoop-2.7.2.tar.gz.part1

大資料之hadoop（檔案系統HDFS）

一 HDFS概述

1.1 概念

1.2 組成

1.3 HDFS 檔案塊大小

二 HFDS命令列操作

三 HDFS客戶端操作

3.1 eclipse環境準備

3.1.1 jar包準備

3.1.2 eclipse準備

3.2 通過API操作HDFS

3.2.1 HDFS獲取檔案系統

3.2.2 HDFS檔案上傳

3.2.3 HDFS檔案下載

3.2.4 HDFS目錄建立

3.2.5 HDFS資料夾刪除

3.2.6 HDFS檔名更改

3.2.7 HDFS檔案詳情檢視

3.2.8 HDFS檔案和資料夾判斷

3.3 通過IO流操作HDFS

3.3.1 HDFS檔案上傳

3.3.2 HDFS檔案下載

3.3.3 定位檔案讀取

相關推薦