Python爬蟲利器：BeautifulSoup庫

阿新 • • 發佈：2017-06-21

環境內容 python網絡 tag ret bsp 標準 requests for

Beautiful Soup parses anything you give it, and does the tree traversal stuff for you.

BeautifulSoup庫是解析、遍歷、維護 “標簽樹” 的功能庫（遍歷，是指沿著某條搜索路線，依次對樹中每個結點均做一次且僅做一次訪問）。https://www.crummy.com/software/BeautifulSoup

BeautifulSoup庫我們常稱之為bs4，導入該庫為：from bs4 import BeautifulSoup。其中，import BeautifulSoup即主要用bs4中的BeautifulSoup類。

bs4庫解析器

技術分享

BeautifulSoup類的基本元素

技術分享

 1 import requests
 2 from bs4 import BeautifulSoup
 3 
 4 res = requests.get(‘http://www.pmcaff.com/site/selection‘)
 5 soup = BeautifulSoup(res.text,‘lxml‘)
 6 print(soup.a)
 7 # 任何存在於HTML語法中的標簽都可以用soup.<tag>訪問獲得，當HTML文檔中存在多個相同<tag>對應內容時，soup.<tag>返回第一個。 

 8 
 9 print(soup.a.name)
10 # 每個<tag>都有自己的名字，可以通過<tag>.name獲取，字符串類型
11 
12 print(soup.a.attrs)
13 print(soup.a.attrs[‘class‘])
14 # 一個<tag>可能有一個或多個屬性，是字典類型
15 
16 print(soup.a.string)
17 # <tag>.string可以取到標簽內非屬性字符串
18 
19 soup1 = BeautifulSoup(‘<p><!--這裏是註釋--></p> 
‘,‘lxml‘)
20 print(soup1.p.string)
21 print(type(soup1.p.string))
22 # comment是一種特殊類型，也可以通過<tag>.string取到

運行結果：

{‘href‘: ‘‘, ‘class‘: [‘no-login‘]} [‘no-login‘]

登錄

這裏是註釋

bs4庫的HTML內容遍歷

HTML的基本結構

技術分享

標簽樹的下行遍歷

技術分享

其中，BeautifulSoup類型是標簽樹的根節點。

1 # 遍歷兒子節點
2 for child in soup.body.children:
3     print(child.name)
4 
5 # 遍歷子孫節點
6 for child in soup.body.descendants:
7     print(child.name)

標簽樹的上行遍歷

技術分享

1 # 遍歷所有先輩節點時，包括soup本身，所以要if...else...判斷
2 for parent in soup.a.parents:
3     if parent is None:
4         print(parent)
5     else:
6         print(parent.name)

運行結果：

div

body

html

[document]

標簽樹的平行遍歷

技術分享

1 # 遍歷後續節點
2 for sibling in soup.a.next_sibling:
3     print(sibling)
4 
5 # 遍歷前續節點
6 for sibling in soup.a.previous_sibling:
7     print(sibling)

bs4庫的prettify()方法

prettify()方法可以將代碼格式搞的標準一些，用soup.prettify()表示。在PyCharm中，用print(soup.prettify())來輸出。

操作環境：Mac，Python 3.6，PyCharm 2016.2

參考資料：中國大學MOOC課程《Python網絡爬蟲與信息提取》

----- End -----

更多精彩內容關註我公眾號：杜王丹

作者：杜王丹，互聯網產品經理

技術分享

Python爬蟲利器：BeautifulSoup庫

環境內容 python網絡 tag ret bsp 標準 requests for Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. Beautif

Python爬蟲利器：BeautifulSoup庫

Python爬蟲利器：BeautifulSoup庫

Python爬蟲入門：Urllib庫的基本使用

Python爬蟲利器：Beautiful Soup

爬蟲利器：Requests庫使用

python爬蟲學習筆記四：BeautifulSoup庫對HTML文字進行操作

Python爬蟲（三）：BeautifulSoup庫

Python爬蟲利器三之Xpath語法與lxml庫的用法

爬蟲：BeautifulSoup庫的使用

Python爬蟲利器一之Requests庫的用法

Python爬蟲之利用BeautifulSoup爬取豆瓣小說（三）——將小說信息寫入文件

Python爬蟲系列：判斷目標網頁編碼的幾種方法

Python 爬蟲系列：糗事百科最熱段子

python爬蟲筆記----4.Selenium庫（自動化庫）

Python爬蟲案例：利用Python爬取笑話網

python爬蟲學習：第一爬_快眼看書排行榜

python爬蟲實戰：利用scrapy，短短50行代碼下載整站短視頻

Python爬蟲(2)：溴事百科

Python爬蟲實戰：股票資料定向爬蟲

Python爬蟲教程：簡書文章的抓取與儲存

Python爬蟲基礎：驗證碼的爬取和識別詳解

Python爬蟲利器：BeautifulSoup庫

相關推薦