非結構化數據與結構化數據提取----XPath與lxml類庫

阿新 • • 發佈：2018-10-13

html ext sce .html 文件系統結構化數據繼續 http encoding

什麽是XML

XML 指可擴展標記語言（EXtensible Markup Language）
XML 是一種標記語言，很類似 HTML
XML 的設計宗旨是傳輸數據，而非顯示數據
XML 的標簽需要我們自行定義。
XML 被設計為具有自我描述性。
XML 是 W3C 的推薦標準

W3School官方文檔：http://www.w3school.com.cn/xml/index.asp

XML 和 HTML 的區別

數據格式	描述	設計目標
XML	Extensible Markup Language `（可擴展標記語言）`	被設計為傳輸和存儲數據，其焦點是數據的內容。
HTML	HyperText Markup Language `（超文本標記語言）`	顯示數據以及如何更好顯示數據。
HTML DOM	Document Object Model for HTML `(文檔對象模型)`	通過 HTML DOM，可以訪問所有的 HTML 元素，連同它們所包含的文本和屬性。可以對其中的內容進行修改和刪除，同時也可以創建新的元素。

XML文檔示例

<?xml version="1.0" encoding="utf-8"?>

<bookstore> 

  <book category="cooking"> 
    <title lang="en">Everyday Italian</title>  
    <author>Giada De Laurentiis</author>  
    <year>2005</year>  
    <price>30.00</price> 
  </book>  

  <book category="children"> 
    <title lang="en">Harry Potter</title>  
    <author>J K. Rowling</author>  
    <year>2005</year>  
    <price>29.99</price> 
  </book>  

  <book category="web"> 
    <title lang="en">XQuery Kick Start</title>  
    <author>James McGovern</author>  
    <author>Per Bothner</author>  
    <author>Kurt Cagle</author>  
    <author>James Linn</author>  
    <author>Vaidyanathan Nagarajan</author>  
    <year>2003</year>  
    <price>49.99</price> 
  </book> 

  <book category="web" cover="paperback"> 
    <title lang="en">Learning XML</title>  
    <author>Erik T. Ray</author>  
    <year>2003</year>  
    <price>39.95</price> 
  </book> 

</bookstore>

HTML DOM 模型示例

HTML DOM 定義了訪問和操作 HTML 文檔的標準方法，以樹結構方式表達 HTML 文檔。

技術分享圖片

XML的節點關系

1. 父（Parent）

每個元素以及屬性都有一個父。

下面是一個簡單的XML例子中，book 元素是 title、author、year 以及 price 元素的父：

<?xml version="1.0" encoding="utf-8"?>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

2. 子（Children）

元素節點可有零個、一個或多個子。

在下面的例子中，title、author、year 以及 price 元素都是 book 元素的子：

<?xml version="1.0" encoding="utf-8"?>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

3. 同胞（Sibling）

擁有相同的父的節點

在下面的例子中，title、author、year 以及 price 元素都是同胞：

<?xml version="1.0" encoding="utf-8"?>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

4. 先輩（Ancestor）

某節點的父、父的父，等等。

在下面的例子中，title 元素的先輩是 book 元素和 bookstore 元素：

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

</bookstore>

5. 後代（Descendant）

某個節點的子，子的子，等等。

在下面的例子中，bookstore 的後代是 book、title、author、year 以及 price 元素：

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

</bookstore>

什麽是XPath？

XPath (XML Path Language) 是一門在 XML 文檔中查找信息的語言，可用來在 XML 文檔中對元素和屬性進行遍歷。

W3School官方文檔：http://www.w3school.com.cn/xpath/index.asp

XPath 開發工具

開源的XPath表達式編輯工具:XMLQuire(XML格式文件可用)
Chrome插件 XPath Helper
Firefox插件 XPath Checker

選取節點

XPath 使用路徑表達式來選取 XML 文檔中的節點或者節點集。這些路徑表達式和我們在常規的電腦文件系統中看到的表達式非常相似。

下面列出了最常用的路徑表達式：

表達式	描述
nodename	選取此節點的所有子節點。
/	從根節點選取。
//	從匹配選擇的當前節點選擇文檔中的節點，而不考慮它們的位置。
.	選取當前節點。
..	選取當前節點的父節點。
@	選取屬性。

在下面的表格中，我們已列出了一些路徑表達式以及表達式的結果：

	路徑表達式	結果
bookstore	選取 bookstore 元素的所有子節點。
/bookstore	選取根元素 bookstore。註釋：假如路徑起始於正斜杠( / )，則此路徑始終代表到某元素的絕對路徑！
bookstore/book	選取屬於 bookstore 的子元素的所有 book 元素。
//book	選取所有 book 子元素，而不管它們在文檔中的位置。
bookstore//book	選擇屬於 bookstore 元素的後代的所有 book 元素，而不管它們位於 bookstore 之下的什麽位置。
//@lang	選取名為 lang 的所有屬性。

謂語（Predicates）

謂語用來查找某個特定的節點或者包含某個指定的值的節點，被嵌在方括號中。

在下面的表格中，我們列出了帶有謂語的一些路徑表達式，以及表達式的結果：

路徑表達式	結果
/bookstore/book[1]	選取屬於 bookstore 子元素的第一個 book 元素。
/bookstore/book[last()]	選取屬於 bookstore 子元素的最後一個 book 元素。
/bookstore/book[last()-1]	選取屬於 bookstore 子元素的倒數第二個 book 元素。
/bookstore/book[position()<3]	選取最前面的兩個屬於 bookstore 元素的子元素的 book 元素。
//title[@lang]	選取所有擁有名為 lang 的屬性的 title 元素。
//title[@lang=’eng’]	選取所有 title 元素，且這些元素擁有值為 eng 的 lang 屬性。
/bookstore/book[price>35.00]	選取 bookstore 元素的所有 book 元素，且其中的 price 元素的值須大於 35.00。
/bookstore/book[price>35.00]/title	選取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值須大於 35.00。

選取未知節點

XPath 通配符可用來選取未知的 XML 元素。

通配符	描述
*	匹配任何元素節點。
@*	匹配任何屬性節點。
node()	匹配任何類型的節點。

在下面的表格中，我們列出了一些路徑表達式，以及這些表達式的結果：

路徑表達式	結果
/bookstore/*	選取 bookstore 元素的所有子元素。
//*	選取文檔中的所有元素。
//title[@*]	選取所有帶有屬性的 title 元素。

選取若幹路徑

通過在路徑表達式中使用“|”運算符，您可以選取若幹個路徑。

實例

在下面的表格中，我們列出了一些路徑表達式，以及這些表達式的結果：

路徑表達式	結果
//book/title \| //book/price	選取 book 元素的所有 title 和 price 元素。
//title \| //price	選取文檔中的所有 title 和 price 元素。
/bookstore/book/title \| //price	選取屬於 bookstore 元素的 book 元素的所有 title 元素，以及文檔中所有的 price 元素。

XPath的運算符

下面列出了可用在 XPath 表達式中的運算符：

技術分享圖片

這些就是XPath的語法內容，在運用到Python抓取時要先轉換為xml。

lxml庫

lxml 是一個HTML/XML的解析器，主要的功能是如何解析和提取 HTML/XML 數據。

lxml和正則一樣，也是用 C 實現的，是一款高性能的 Python HTML/XML 解析器，我們可以利用之前學習的XPath語法，來快速的定位特定元素以及節點信息。

lxml python 官方文檔：http://lxml.de/index.html

需要安裝C語言庫，可使用 pip 安裝：pip install lxml （或通過wheel方式安裝）

初步使用

我們利用它來解析 HTML 代碼，簡單示例：

# lxml_test.py

# 使用 lxml 的 etree 庫
from lxml import etree 

text = ‘‘‘
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a> # 註意，此處缺少一個 </li> 閉合標簽
     </ul>
 </div>
‘‘‘

#利用etree.HTML，將字符串解析為HTML文檔
html = etree.HTML(text) 

# 按字符串序列化HTML文檔
result = etree.tostring(html) 

print(result)

輸出結果：

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

lxml 可以自動修正 html 代碼，例子裏不僅補全了 li 標簽，還添加了 body，html 標簽。

文件讀取：

除了直接讀取字符串，lxml還支持從文件裏讀取內容。我們新建一個hello.html文件：

<!-- hello.html -->

<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

再利用 etree.parse() 方法來讀取文件。

# lxml_parse.py

from lxml import etree

# 讀取外部文件 hello.html
html = etree.parse(‘./hello.html‘)
result = etree.tostring(html, pretty_print=True)

print(result)

輸出結果與之前相同：

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

XPath實例測試

1. 獲取所有的 `<li>` 標簽

# xpath_li.py

from lxml import etree

html = etree.parse(‘hello.html‘)
print type(html)  # 顯示etree.parse() 返回類型

result = html.xpath(‘//li‘)

print result  # 打印<li>標簽的元素集合
print len(result)
print type(result)
print type(result[0])

輸出結果：

<type ‘lxml.etree._ElementTree‘>
[<Element li at 0x1014e0e18>, <Element li at 0x1014e0ef0>, <Element li at 0x1014e0f38>, <Element li at 0x1014e0f80>, <Element li at 0x1014e0fc8>]
5
<type ‘list‘>
<type ‘lxml.etree._Element‘>

2. 繼續獲取`<li>` 標簽的所有 `class`屬性

# xpath_li.py

from lxml import etree

html = etree.parse(‘hello.html‘)
result = html.xpath(‘//li/@class‘)

print result

運行結果

[‘item-0‘, ‘item-1‘, ‘item-inactive‘, ‘item-1‘, ‘item-0‘]

3. 繼續獲取`<li>`標簽下`hre` 為 `link1.html` 的 `<a>` 標簽

# xpath_li.py

from lxml import etree

html = etree.parse(‘hello.html‘)
result = html.xpath(‘//li/a[@href="link1.html"]‘)

print result

運行結果

[<Element a at 0x10ffaae18>]

4. 獲取`<li>` 標簽下的所有 `<span>` 標簽

# xpath_li.py

from lxml import etree

html = etree.parse(‘hello.html‘)

#result = html.xpath(‘//li/span‘)
#註意這麽寫是不對的：
#因為 / 是用來獲取子元素的，而 <span> 並不是 <li> 的子元素，所以，要用雙斜杠

result = html.xpath(‘//li//span‘)

print result

運行結果

[<Element span at 0x10d698e18>]

5. 獲取 `<li>` 標簽下的`<a>`標簽裏的所有 class

# xpath_li.py

from lxml import etree

html = etree.parse(‘hello.html‘)
result = html.xpath(‘//li/a//@class‘)

print result

運行結果

[‘blod‘]

6. 獲取最後一個 `<li>` 的 `<a>` 的 href

# xpath_li.py

from lxml import etree

html = etree.parse(‘hello.html‘)

result = html.xpath(‘//li[last()]/a/@href‘)
# 謂語 [last()] 可以找到最後一個元素

print result

運行結果

[‘link5.html‘]

7. 獲取倒數第二個元素的內容

# xpath_li.py

from lxml import etree

html = etree.parse(‘hello.html‘)
result = html.xpath(‘//li[last()-1]/a‘)

# text 方法可以獲取元素內容
print result[0].text

運行結果

fourth item

8. 獲取 `class` 值為 `bold` 的標簽名

# xpath_li.py

from lxml import etree

html = etree.parse(‘hello.html‘)

result = html.xpath(‘//*[@class="bold"]‘)

# tag方法可以獲取標簽名
print result[0].tag

運行結果

span

非結構化數據與結構化數據提取----XPath與lxml類庫

html ext sce .html 文件系統結構化數據繼續 http encoding 什麽是XML XML 指可擴展標記語言（EXtensible Markup Language） XML 是一種標記語言，很類似 HTML XML 的設計宗旨是傳輸數據，而非顯示數

python爬蟲7——XPath與lxml類庫、xpath helper外掛

有同學說，我正則用的不好，處理HTML文件很累，有沒有其他的方法？有！那就是XPath，我們可以先將 HTML檔案轉換成 XML文件，然後用 XPath 查詢 HTML 節點或元素。什麼是XML XML 指可擴充套件標記語言（EXtensible Marku

Python爬蟲(十二)_XPath與lxml類庫

Python學習指南有同學說，我正則用的不好，處理HTML文件很累，有沒有其他的方法？有！那就是XPath,我們可以用先將HTML文件轉換成XML文件，然後用XPath查詢HTML節點或元素。什麼是XML XML指可擴充套件標記語言(Extensi

hbase非結構化數據庫與結構化數據庫比較

數據可靠性插入聯網定位海量數據倍增關系型字符類型文件目的：了解hbase與支持海量數據查詢的特性以及實現方式傳統關系型數據庫特點及局限傳統數據庫事務性特別強，要求數據完整性及安全性，造成系統可用性以及伸縮性大打折扣。對於高並發的訪問量，數據庫性

非結構化數據與結構化數據提取---正則表達式re模塊

dict pos 叠代器 utf-8 lan .net -c att position 頁面解析和數據提取一般來講對我們而言，需要抓取的是某個網站或者某個應用的內容，提取有用的價值。內容一般分為兩部分，非結構化的數據和結構化的數據。非結構化數據：先有數據，再有結構

非結構化資料與結構化資料提取--- JSON模組與JsonPath

資料提取之JSON與JsonPATH JSON(JavaScript Object Notation) 是一種輕量級的資料交換格式，它使得人們很容易的進行閱讀和編寫。同時也方便了機器進行解析和生成。適用於進行資料互動的場景，比如網站前臺與後臺之間的資料互動。 JSON和XML的比較可謂不相上下。 Pyt

非結構化資料與結構化資料提取---多執行緒爬蟲案例

多執行緒糗事百科案例案例要求參考上一個糗事百科單程序案例 Queue（佇列物件） Queue是python中的標準庫，可以直接import Queue引用;佇列是執行緒間最常用的交換資料的形式 python下多執行緒的思考對於資源，加鎖是個重要的環節。因為python原生的list,dict等，

非結構化資料與結構化資料提取---- 案例：使用bs4的爬蟲

案例：使用BeautifuSoup4的爬蟲我們以騰訊社招頁面來做演示：http://hr.tencent.com/position.php?&start=10#a 使用BeautifuSoup4解析器，將招聘網頁上的職位名稱、職位類別、招聘人數、工作地點、釋出

黑馬python2.7的爬蟲2-非結構化資料與結構化資料提取

非結構化資料與結構化資料提取抓取的是某個網站或者某個應用的內容，提取有用的價值。內容一般分為兩部分，非結構化的資料和結構化的資料。非結構化資料：先有資料，再有結構，結構化資料：先有結構、再有資料不同型別的資料，我們需要採用不同的方式來處理。1、非結構化的資料處理文字、電話

新書創作談：周立功教授數十年之心血力作《程序設計與數據結構》

發生技術資源進行面向接口推導知識以及指針近日，周立功教授公開了數十年之心血力作《程序設計與數據結構》，此書在4月28日落筆，電子版已無償性分享到電子工程師與高校群體，在致遠電子公眾號後臺回復關鍵字【程序設計】可在線閱讀。在程序設計過程中，很多開發人員在

6.6-2-數組與數據結構（用數組及其函數實現堆棧等數據結構）

var 元素 shift () span bsp key 數組數字 9.5.6.1使用數組實現堆棧實現棧 1. int array_push ( array array ,mixed var [,mixed.] ) 添加參數到數組尾部，key+1 ，返回數組元素個數即

ASP.NET MVC編程入門--MVC5 傳遞參數與初始化數據

port ctp params cti 模型 top help mvc ring 傳遞參數格式： $(".limit").live("click", function () { top.location = "/Product

作業題：輸入4個整數，找出其中最大的數。用一個函數來實現. 分別使用結構化方法和函數嵌套的方法。

system 是否進行如果 div 使用 clu 函數整型之前在main()函數中的思路是： #include <iostream> using namespace std; int main(){ //求四個數中最大的數？ /

Python與數據結構[0] -> 鏈表[2] -> 鏈表有環與鏈表相交判斷的 Python 實現

lis 退出測試 htm 判斷鏈表是否有環 += 帶環鏈表 off long 鏈表有環與鏈表相交判斷的 Python 實現目錄有環鏈表相交鏈表 1 有環鏈表判斷鏈表是否有環可以參考鏈接，有環鏈表主要包括以下幾個問題（C語言描述）：判斷環是否存在：

JavaScript數據結構與算法-數組練習

二維 console 單詞 rri ++ day 個數 total 數組練習一. 創建一個記錄學生成績的對象，提供一個添加成績的方法，以及一個顯示學生平均成績的方法。 // 創建一個記錄學生成績的對象 const Students = function Students

數據結構與算法（二）--棧與隊列

break col color 一個大小 amp 頂上 const 試題棧和隊列棧和隊列都是比較常用的數據結構。棧的應用非常的廣泛，比如說，遞歸函數的實現就是借助於棧保存相關的數據。操作系統中每個線程也會使用棧來保存函數調用涉及到的一些參數和其他變量等。棧最大的一個特

數據結構與算法 - 數組

繼續維數 n-2 相同元素判斷第一個 hash表 pat color 題型1：如何用遞歸實現數組求和方法1：題型2：如何用一個for循環打印一個二維數組方法1：array在二維數組中的行號和列號分別為[i/MAXY]，[i%MAXY] 題型3：用遞歸和非遞歸的方

數據算法與結構

oot 冒泡使用 ali 位置 runtime 上界其它分析算法 http://dongxicheng.org/structure/structure-algorithm-summary/ https://www.cnblogs.com/zhuzhenwei918/p

pdf解析與結構化提取

選擇同時開始轉換 table () IT body 取數 PDF解析與結構化提取 PDF解析對於PDF文檔，我們選擇用PDFMiner對其進行解析，得到文本。 PDFMiner PDFMiner使用了一種稱作lazy parsing的策略，只在需要的時候才去解析，以

js解析與序列化json數據

color AS 簡單字符 IT tro 需要 strong init 一、前言：JSON對象有兩個方法：stringify()和parse()。二、介紹：在最簡單的情況下，這兩個方法分別用於把JavaScript對象序列化為JSON字符串和把JSON字符串解析為原生J

非結構化數據與結構化數據提取----XPath與lxml類庫

什麽是XML

XML 和 HTML 的區別

XML文檔示例

HTML DOM 模型示例

XML的節點關系

1. 父（Parent）

2. 子（Children）

3. 同胞（Sibling）

4. 先輩（Ancestor）

5. 後代（Descendant）

什麽是XPath？

XPath 開發工具

選取節點

謂語（Predicates）

選取未知節點

選取若幹路徑

XPath的運算符

這些就是XPath的語法內容，在運用到Python抓取時要先轉換為xml。

lxml庫

初步使用

文件讀取：

XPath實例測試

1. 獲取所有的 <li> 標簽

2. 繼續獲取<li> 標簽的所有 class屬性

3. 繼續獲取<li>標簽下hre 為 link1.html 的 <a> 標簽

4. 獲取<li> 標簽下的所有 <span> 標簽

5. 獲取 <li> 標簽下的<a>標簽裏的所有 class

6. 獲取最後一個 <li> 的 <a> 的 href

7. 獲取倒數第二個元素的內容

8. 獲取 class 值為 bold 的標簽名

相關推薦

1. 獲取所有的 `<li>` 標簽

2. 繼續獲取`<li>` 標簽的所有 `class`屬性

3. 繼續獲取`<li>`標簽下`hre` 為 `link1.html` 的 `<a>` 標簽

4. 獲取`<li>` 標簽下的所有 `<span>` 標簽

5. 獲取 `<li>` 標簽下的`<a>`標簽裏的所有 class

6. 獲取最後一個 `<li>` 的 `<a>` 的 href

8. 獲取 `class` 值為 `bold` 的標簽名