1. 程式人生 > 程式設計 >Python爬蟲庫BeautifulSoup的介紹與簡單使用例項

Python爬蟲庫BeautifulSoup的介紹與簡單使用例項

一、介紹

BeautifulSoup庫是靈活又方便的網頁解析庫,處理高效,支援多種解析器。利用它不用編寫正則表示式即可方便地實現網頁資訊的提取。

Python常用解析庫

解析器 使用方法 優勢 劣勢
Python標準庫 BeautifulSoup(markup,“html.parser”) Python的內建標準庫、執行速度適中 、文件容錯能力強 Python 2.7.3 or 3.2.2)前的版本中文容錯能力差
lxml HTML 解析器 BeautifulSoup(markup,“lxml”) 速度快、文件容錯能力強 需要安裝C語言庫
lxml XML 解析器 BeautifulSoup(markup,“xml”) 速度快、唯一支援XML的解析器 需要安裝C語言庫
html5lib BeautifulSoup(markup,“html5lib”) 最好的容錯性、以瀏覽器的方式解析文件、生成HTML5格式的文件 速度慢、不依賴外部擴充套件

二、快速開始

給定html文件,產生BeautifulSoup物件

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">The Dormouse's story</p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')

輸出完整文字

print(soup.prettify())
<html>
 <head>
 <title>
  The Dormouse's story
 </title>
 </head>
 <body>
 <p class="title">
  
  The Dormouse's story
  
 </p>
 <p class="story">
  Once upon a time there were three little sisters; and their names were
  <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">
  Elsie
  </a>,<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">
  Lacie
  </a>
  and
  <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">
  Tillie
  </a>
  ;
and they lived at the bottom of a well.
 </p>
 <p class="story">
  ...
 </p>
 </body>
</html>

瀏覽結構化資料

print(soup.title) #<title>標籤及內容
print(soup.title.name) #<title>name屬性
print(soup.title.string) #<title>內的字串
print(soup.title.parent.name) #<title>的父標籤name屬性(head)
print(soup.p) # 第一個<p></p>
print(soup.p['class']) #第一個<p></p>的class
print(soup.a) # 第一個<a></a>
print(soup.find_all('a')) # 所有<a></a>
print(soup.find(id="link3")) # 所有id='link3'的標籤
<title>The Dormouse's story</title>
title
The Dormouse's story
head
<p class="title">The Dormouse's story</p>
['title']
<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>

找出所有標籤內的連結

for link in soup.find_all('a'):
  print(link.get('href'))
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

獲得所有文字內容

print(soup.get_text())
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,Lacie and
Tillie;
and they lived at the bottom of a well.
...

自動補全標籤並進行格式化

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse">The Dormouse's story</p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')#傳入解析器:lxml
print(soup.prettify())#格式化程式碼,自動補全
print(soup.title.string)#得到title標籤裡的內容

標籤選擇器

選擇元素

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse">The Dormouse's story</p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,'lxml')#傳入解析器:lxml
print(soup.title)#選擇了title標籤
print(type(soup.title))#檢視型別
print(soup.head)

獲取標籤名稱

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')#傳入解析器:lxml
print(soup.title.name)

獲取標籤屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')#傳入解析器:lxml
print(soup.p.attrs['name'])#獲取p標籤中,name這個屬性的值
print(soup.p['name'])#另一種寫法,比較直接

獲取標籤內容

print(soup.p.string)

標籤巢狀選擇

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')#傳入解析器:lxml
print(soup.head.title.string)

子節點和子孫節點

html = """
<html>
  <head>
    <title>The Dormouse's story</title>
  </head>
  <body>
    <p class="story">
      Once upon a time there were three little sisters; and their names were
      <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">
        <span>Elsie</span>
      </a>
      <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> 
      and
      <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>
      and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
"""


from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')#傳入解析器:lxml
print(soup.p.contents)#獲取指定標籤的子節點,型別是list

另一個方法,child:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')#傳入解析器:lxml
print(soup.p.children)#獲取指定標籤的子節點的迭代器物件
for i,children in enumerate(soup.p.children):#i接受索引,children接受內容
	print(i,children)

輸出結果與上面的一樣,多了一個索引。注意,只能用迴圈來迭代出子節點的資訊。因為直接返回的只是一個迭代器物件。

獲取子孫節點:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')#傳入解析器:lxml
print(soup.p.descendants)#獲取指定標籤的子孫節點的迭代器物件
for i,child in enumerate(soup.p.descendants):#i接受索引,child接受內容
	print(i,child)

父節點和祖先節點

parent

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')#傳入解析器:lxml
print(soup.a.parent)#獲取指定標籤的父節點

parents

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')#傳入解析器:lxml
print(list(enumerate(soup.a.parents)))#獲取指定標籤的祖先節點

兄弟節點

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')#傳入解析器:lxml
print(list(enumerate(soup.a.next_siblings)))#獲取指定標籤的後面的兄弟節點
print(list(enumerate(soup.a.previous_siblings)))#獲取指定標籤的前面的兄弟節點

標準選擇器

find_all( name,attrs,recursive,text,**kwargs )

可根據標籤名、屬性、內容查詢文件。

name

html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))#查詢所有ul標籤下的內容
print(type(soup.find_all('ul')[0]))#檢視其型別

下面的例子就是查詢所有ul標籤下的li標籤:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
  print(ul.find_all('li'))

attrs(屬性)

通過屬性進行元素的查詢

html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1" name="elements">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''


from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))#傳入的是一個字典型別,也就是想要查詢的屬性
print(soup.find_all(attrs={'name': 'elements'}))

查詢到的是同樣的內容,因為這兩個屬性是在同一個標籤裡面的。

特殊型別的引數查詢:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(id='list-1'))#id是個特殊的屬性,可以直接使用
print(soup.find_all(class_='element')) #class是關鍵字所以要用class_

text

根據文字內容來進行選擇:

html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text='Foo'))#查詢文字為Foo的內容,但是返回的不是標籤

所以說這個text在做內容匹配的時候比較方便,但是在做內容查詢的時候並不是太方便。

方法

find

find用法和findall一模一樣,但是返回的是找到的第一個符合條件的內容輸出。

ind_parents(), find_parent()

find_parents()返回所有祖先節點,find_parent()返回直接父節點。

find_next_siblings(),find_next_sibling()

find_next_siblings()返回後面的所有兄弟節點,find_next_sibling()返回後面的第一個兄弟節點

find_previous_siblings(),find_previous_sibling()

find_previous_siblings()返回前面所有兄弟節點,find_previous_sibling()返回前面第一個兄弟節點

find_all_next(),find_next()

find_all_next()返回節點後所有符合條件的節點,find_next()返回後面第一個符合條件的節點

find_all_previous(),find_previous()

find_all_previous()返回節點前所有符合條件的節點,find_previous()返回前面第一個符合條件的節點

CSS選擇器 通過select()直接傳入CSS選擇器即可完成選擇

html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.select('.panel .panel-heading'))#.代表class,中間需要空格來分隔
print(soup.select('ul li')) #選擇ul標籤下面的li標籤
print(soup.select('#list-2 .element')) #'#'代表id。這句的意思是查詢id為"list-2"的標籤下的,class=element的元素
print(type(soup.select('ul')[0]))#列印節點型別

再看看層層巢狀的選擇:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
	print(ul.select('li'))

獲取屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
  print(ul['id'])# 用[ ]即可獲取屬性
  print(ul.attrs['id'])#另一種寫法

獲取內容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for li in soup.select('li'):
  print(li.get_text())

用get_text()方法就能獲取內容了。

總結

推薦使用lxml解析庫,必要時使用html.parser

標籤選擇篩選功能弱但是速度快 建議使用find()、find_all() 查詢匹配單個結果或者多個結果

如果對CSS選擇器熟悉建議使用select()

記住常用的獲取屬性和文字值的方法

更多關於Python爬蟲庫BeautifulSoup的介紹與簡單使用例項請點選下面的相關連結