使用python的hdfs包操作分散式檔案系統(HDFS)
=====================================================================================
寫在前邊的話:
之前做的Hadoop叢集,組合了Hive,Hbase,sqoop,Spark等開源工具,現在要對他們做一個Web的視覺化操作,由於本小白只懂如何使用Python做一個互動的web應用,所以這裡就選擇了python的Django
Django教程參考:Django從manage.py shell 到專案部署
hadoop叢集操作請參考:三臺PC伺服器部署高可用hadoop叢集
言歸正傳:
使用python操作hdfs本身並不難,只不過是把對應的shell 功能“翻譯”成高階語言,網上大部分使用的是
pyhdfs:官方文件
hdfs:官方文件
libhdfs(比較狗血)
我這裡選用的是hdfs,下邊的例項都是基於hdfs包進行的
1:安裝
由於我的是windows環境(Linux其實也一樣),只要有pip或者setup_install安裝起來都是很方便的
[plain] view plain copy- pip install hdfs
2:Client——建立叢集連線
[python] view plain copy
- >>> from hdfs import *
- >>> client = Client("http://127.0.0.1:50070")
其他引數說明:
classhdfs.client.
Client
(url, root=None, proxy=None, timeout=None, session=None)
url:ip:埠
root:制定的hdfs根目錄
proxy:制定登陸的使用者身份
timeout:設定的超時時間
seesion:requests.Session instance, used to emit all requests.(不是太懂,應該四使用者發出請求)
這裡我們著重看一下proxy這個,首先我們指定root使用者連線
[html] view plain copy- >>> client = Client("http://127.0.0.1:50070",root="/",timeout=100,session=False)
- >>> client.list("/")
- [u'hbase']
- >>> client = Client("http://127.0.0.1:50070",root="/",proxy="gamer",timeout=100,session=False)
- >>> client.list("/")
- Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
- File "/usr/local/lib/python2.7/dist-packages/hdfs/client.py", line 893, in list
- statuses = self._list_status(hdfs_path).json()['FileStatuses']['FileStatus']
- File "/usr/local/lib/python2.7/dist-packages/hdfs/client.py", line 92, in api_handler
- **self.kwargs
- File "/usr/local/lib/python2.7/dist-packages/hdfs/client.py", line 181, in _request
- return _on_error(response)
- File "/usr/local/lib/python2.7/dist-packages/hdfs/client.py", line 44, in _on_error
- raise HdfsError(message)
- hdfs.util.HdfsError: Failed to obtain user group information: org.apache.hadoop.security.authorize.AuthorizationException: User: dr.who is not allowed to impersonate gamer
3:dir——檢視支援的方法
[python] view plain copy- >>> dir(client)
- ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__',
- '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__registry__',
- '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_append', '_create', '_delete',
- '_get_content_summary', '_get_file_checksum', '_get_file_status', '_get_home_directory', '_list_status', '_mkdirs', '_open',
- '_proxy', '_rename', '_request', '_session', '_set_owner', '_set_permission', '_set_replication', '_set_times', '_timeout',
- 'checksum', 'content', 'delete', 'download', 'from_options', 'list', 'makedirs', 'parts', 'read', 'rename', 'resolve', 'root',
- 'set_owner', 'set_permission', 'set_replication', 'set_times', 'status', 'upload',
- 'url', 'walk', 'write']
4:status——獲取路徑的具體資訊
[python] view plain copy
- >>> client.status("/")
- {'accessTime': 0, 'pathSuffix': '', 'group': 'supergroup', 'type': 'DIRECTORY', 'owner': 'root', 'childrenNum': 4, 'blockSize': 0,
- 'fileId': 16385, 'length': 0, 'replication': 0, 'storagePolicy': 0, 'modificationTime': 1473023149031, 'permission': '777'}
其他引數:status
(hdfs_path, strict=True)
hdfs_path:就是hdfs路徑
strict:設定為True時,如果hdfs_path路徑不存在就會丟擲異常,如果設定為False,如果路徑為不存在,則返回None
[python] view plain copy- >>> client = Client("http://127.0.0.1:50070",root="/",timeout=100,session=False)
- >>> client.status("/gamer",strict=True)
- Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
- File "/usr/local/lib/python2.7/dist-packages/hdfs/client.py", line 277, in status
- res = self._get_file_status(hdfs_path, strict=strict)
- File "/usr/local/lib/python2.7/dist-packages/hdfs/client.py", line 92, in api_handler
- **self.kwargs
- File "/usr/local/lib/python2.7/dist-packages/hdfs/client.py", line 181, in _request
- return _on_error(response)
- File "/usr/local/lib/python2.7/dist-packages/hdfs/client.py", line 44, in _on_error
- raise HdfsError(message)
- hdfs.util.HdfsError: File does not exist: /gamer
- >>> client.status("/gamer",strict=False)
- >>>
5:list——獲取指定路徑的子目錄資訊
[python] view plain copy- >>> client.list("/")
- ['file', 'gyt', 'hbase', 'tmp']
其他引數:list
(hdfs_path, status=False)
&nb