BS4庫的解析器！

阿新 • • 發佈：2019-01-02

bs4庫之所以能快速的定位我們想要的元素，是因為他能夠用一種方式將html檔案解析了一遍，不同的解析器有不同的效果。下文將一一進行介紹。

bs4解析器的選擇

網路爬蟲的最終目的就是過濾選取網路資訊，最重要的部分可以說是解析器。解析器的優劣決定了爬蟲的速度和效率。bs4庫除了支援我們上文用過的‘html.parser’解析器外，還支援很多第三方的解析器，下面我們來對他們進行對比分析。

bs4庫官方推薦我們使用的是lxml解析器，原因是它具有更高的效率，所以我們也將採用lxml解析器。

進群：960410445 即可獲取數十套PDF！

lxml解析器的安裝：

依舊採用pip安裝工具來安裝：

$ pip install lxml

注意，由於我用的是unix類系統，用pip工具十分的方便，但是如果在windows下安裝，總是會出現這樣或者那樣的問題，這裡推薦win使用者去lxml官方，下載安裝包，來安裝適合自己系統版本的lxml解析器。

使用lxml解析器來解釋網頁

我們依舊以上一篇的愛麗絲文件為例子:

html_doc = """
 <html><head><title>The Dormouse's story</title></head>
 <body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>
 <p class="story">...</p>
 """
複製程式碼

試一下吧：

import bs4
#首先我們先將html檔案已lxml的方式做成一鍋湯
soup = bs4.BeautifulSoup(open('Beautiful Soup 爬蟲/demo.html'),'lxml')
#我們把結果輸出一下，是一個很清晰的樹形結構。
#print(soup.prettify())
'''
OUT:
<html>
 <head>
 <title>
 The Dormouse's story
 </title>
 </head>
 <body>
 <p class="title">
 <b>
 The Dormouse's story
 </b>
 </p>
 <p class="story">
 Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">
 Elsie
 </a>
 ,
 <a class="sister" href="http://example.com/lacie" id="link2">
 Lacie
 </a>
 and
 <a class="sister" href="http://example.com/tillie" id="link3">
 Tillie
 </a>
 ;
and they lived at the bottom of a well.
 </p>
 <p class="story">
 ...
 </p>
 </body>
</html>
···
複製程式碼

如何具體的使用？

bs4 庫首先將傳入的字串或檔案控制代碼轉換為 Unicode的型別，這樣，我們在抓取中文資訊的時候，就不會有很麻煩的編碼問題了。當然，有一些生僻的編碼如：‘big5’，就需要我們手動設定編碼： soup = BeautifulSoup(markup, from_encoding="編碼方式")

物件的種類：

bs4 庫將複雜的html文件轉化為一個複雜的樹形結構，每個節點都是Python物件，所有物件可以分為以下四個型別：Tag , NavigableString , BeautifulSoup , Comment 我們來逐一解釋：

Tag：和html中的Tag基本沒有區別，可以簡單上手使用

NavigableString：被包裹在tag內的字串

BeautifulSoup：表示一個文件的全部內容，大部分的時候可以吧他看做一個tag物件，支援遍歷文件樹和搜尋文件樹方法。

Comment：這是一個特殊的NavigableSting物件，在出現在html文件中時，會以特殊的格式輸出，比如註釋型別。

搜尋文件樹的最簡單的方法就是搜尋你想獲取tag的的name：

soup.head
# <head><title>The Dormouse's story</title></head>
soup.title
# <title>The Dormouse's story</title>
複製程式碼

如果你還想更深入的獲得更小的tag：例如我們想找到body下的被b標籤包裹的部分

soup.body.b
# <b>The Dormouse's story</b>
複製程式碼

但是這個方法只能找到按順序第一個出現的tag。

獲取所有的標籤呢？

這個時候需要find_all()方法，他返回一個列表型別

tag=soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
#假設我們要找到a標籤中的第二個元素：
need = tag[1]
#簡單吧
複製程式碼

tag的.contents屬性可以將tag的子節點以列表的方式輸出：

head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>
head_tag.contents
[<title>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
print(title_tag)
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']
複製程式碼

另外通過tag的 .children生成器，可以對tag的子節點進行迴圈：

for child in title_tag.children:
 print(child)
 # The Dormouse's story
複製程式碼

這種方式只能遍歷出子節點。如何遍歷出子孫節點呢？子孫節點：比如 head.contents 的子節點是,這裡 title本身也有子節點：‘The Dormouse‘s story’ 。這裡的‘The Dormouse‘s story’也叫作head的子孫節點

for child in head_tag.descendants:
 print(child)
 # <title>The Dormouse's story</title>
 # The Dormouse's story
複製程式碼

如何找到tag下的所有的文字內容呢？

如果該tag只有一個子節點（NavigableString型別）：直接使用tag.string就能找到。
如果tag有很多個子、孫節點，並且每個節點裡都string：

我們可以用迭代的方式將其全部找出：

for string in soup.strings:
 print(repr(string))
 # u"The Dormouse's story"
 # u'

'
 # u"The Dormouse's story"
 # u'

'
 # u'Once upon a time there were three little sisters; and their names were
'
 # u'Elsie'
 # u',
'
 # u'Lacie'
 # u' and
'
 # u'Tillie'
 # u';
and they lived at the bottom of a well.'
 # u'

'
 # u'...'
 # u'
'
複製程式碼

好了，關於bs4庫的基本使用，我們就先介紹到這。剩下來的部分：父節點、兄弟節點、回退和前進，都與上面從子節點找元素的過程差不多。

BS4庫的解析器！

BS4庫的解析器！

效能比較：lxml庫,正則表示式，BeautifulSoup ，用資料證明lxml解析器速度快

IT民工——發一個萬能的JSON解析器吧！

BeautifulSoup庫未寫明解析器警告

golang開發:類庫篇(四)配置檔案解析器goconfig的使用

企業雲桌面-12-安裝數據庫服務器-111-CTXdb01

筆記：XML-解析文檔-流機制解析器（SAX、StAX）

javascript解析器原理

用心告訴您用什麽理由選擇高防服務器！

Brainfuck解析器(Python)

Java端ACM輸入解析器（高效）

rest-Assured-解析json錯誤-需使用預定義的解析器解析

pyparsing：定制自己的解析器

本地phpmyadmin 訪問遠程數據庫服務器

有關創建數據庫服務器以及mysql導數據庫的相關內容

ubuntu10.04 默認腳本解析器更改(./sdk.unpack: 2: source: not found)

4.創建數據庫服務器（MySQL）：

Informatica元數據庫解析

7.SpringMVC 配置式開發-ModelAndView和視圖解析器

關於css選擇器的一些事第一章基本選擇器！

BS4庫的解析器！

相關推薦