使用Python爬蟲庫BeautifulSoup遍歷文件樹並對標籤進行操作詳解

阿新 • • 發佈：2020-02-01

下面就是使用Python爬蟲庫BeautifulSoup對文件樹進行遍歷並對標籤進行操作的例項，都是最基礎的內容

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title">The Dormouse's story</p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'lxml')

一、子節點

一個Tag可能包含多個字串或者其他Tag，這些都是這個Tag的子節點.BeautifulSoup提供了許多操作和遍歷子結點的屬性。

1.通過Tag的名字來獲得Tag

print(soup.head)
print(soup.title)

<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>

通過名字的方法只能獲得第一個Tag，如果要獲得所有的某種Tag可以使用find_all方法

soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

2.contents屬性：將Tag的子節點通過列表的方式返回

head_tag = soup.head
head_tag.contents

[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]
title_tag

<title>The Dormouse's story</title>

title_tag.contents

["The Dormouse's story"]

3.children：通過該屬性對子節點進行迴圈

for child in title_tag.children:
  print(child)

The Dormouse's story

4.descendants：不論是contents還是children都是返回直接子節點，而descendants對所有tag的子孫節點進行遞迴迴圈

for child in head_tag.children:
  print(child)

<title>The Dormouse's story</title>

for child in head_tag.descendants:
  print(child)

<title>The Dormouse's story</title>
The Dormouse's story

5.string 如果tag只有一個NavigableString型別的子節點，那麼tag可以使用.string得到該子節點

title_tag.string

"The Dormouse's story"

如果一個tag只有一個子節點，那麼使用.string可以獲得其唯一子結點的NavigableString.

head_tag.string

"The Dormouse's story"

如果tag有多個子節點，tag無法確定.string對應的是那個子結點的內容，故返回None

print(soup.html.string)

None

6.strings和stripped_strings

如果tag包含多個字串，可以使用.strings迴圈獲取

for string in soup.strings:
  print(string)

The Dormouse's story


The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie,Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...

.string輸出的內容包含了許多空格和空行，使用strpped_strings去除這些空白內容

for string in soup.stripped_strings:
  print(string)

The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,Lacie
and
Tillie
;
and they lived at the bottom of a well.
...

二、父節點

1.parent：獲得某個元素的父節點

title_tag = soup.title
title_tag.parent

<head><title>The Dormouse's story</title></head>

字串也有父節點

title_tag.string.parent

<title>The Dormouse's story</title>

2.parents：遞迴的獲得所有父輩節點

link = soup.a
for parent in link.parents:
  if parent is None:
    print(parent)
  else:
    print(parent.name)

p
body
html
[document]

三、兄弟結點

sibling_soup = BeautifulSoup("<a>text1<c>text2</c></a>",'lxml')
print(sibling_soup.prettify())

<html>
 <body>
 <a>
  
  text1
  
  <c>
  text2
  </c>
 </a>
 </body>
</html>

1.next_sibling和previous_sibling

sibling_soup.b.next_sibling

<c>text2</c>

sibling_soup.c.previous_sibling

text1

在實際文件中.next_sibling和previous_sibling通常是字串或者空白符

soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

soup.a.next_sibling # 第一個<a></a>的next_sibling是,\n

',\n'

soup.a.next_sibling.next_sibling

<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>

2.next_siblings和previous_siblings

for sibling in soup.a.next_siblings:
  print(repr(sibling))

',\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'

for sibling in soup.find(id="link3").previous_siblings:
  print(repr(sibling))

' and\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'

四、回退與前進

1.next_element和previous_element

指向下一個或者前一個被解析的物件(字串或tag)，即深度優先遍歷的後序節點和前序節點

last_a_tag = soup.find("a",id="link3")
print(last_a_tag.next_sibling)
print(last_a_tag.next_element)

;
and they lived at the bottom of a well.
Tillie

last_a_tag.previous_element

' and\n'

2.next_elements和previous_elements

通過.next_elements和previous_elements可以向前或向後訪問文件的解析內容，就好像文件正在被解析一樣

for element in last_a_tag.next_elements:
  print(repr(element))

'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'

更多關於使用Python爬蟲庫BeautifulSoup遍歷文件樹並對標籤進行操作的方法與文章大家可以點選下面的相關文章

使用Python爬蟲庫BeautifulSoup遍歷文件樹並對標籤進行操作詳解

下面就是使用Python爬蟲庫BeautifulSoup對文件樹進行遍歷並對標籤進行操作的例項，都是最基礎的內容

bs4遍歷文件樹+bs4搜尋文件樹

# cicd，後端專案高可用，redis高可用，許可權管理表設計-持續整合持續部署 jenkins -開發寫完程式碼---》gitlab---》jenkins定時從gitlab拉取程式碼---》編譯---》把可執行檔案---》測試伺服器（docker倉

bs4:遍歷文件樹，搜尋文件書，css選擇器

爬取汽車之家新聞：地址:https://www.autohome.com.cn/news/ 目的：爬取文章的標題，圖片，簡介。

使用requests爬取梨視訊、bilibili視訊、汽車之家，bs4遍歷文件樹、搜尋文件樹，css選擇器

今日內容概要使用requests爬取梨視訊 requests+bs4爬取汽車之家 bs4遍歷文件樹 bs4搜尋文件樹

python使用hdfs3模組對hdfs進行操作詳解

之前一直使用hdfs的命令進行hdfs操作，比如： hdfs dfs -ls /user/spark/ hdfs dfs -get /user/spark/a.txt /home/spark/a.txt #從HDFS獲取資料到本地

Python爬蟲庫BeautifulSoup的介紹與簡單使用例項

一、介紹 BeautifulSoup庫是靈活又方便的網頁解析庫，處理高效，支援多種解析器。利用它不用編寫正則表示式即可方便地實現網頁資訊的提取。

Python爬蟲庫BeautifulSoup獲取物件(標籤)名,屬性,內容,註釋

一、Tag(標籤)物件 1.Tag物件與XML或HTML原生文件中的tag相同。 from bs4 import BeautifulSoup

mysql儲存過程基礎之遍歷多表記錄後插入第三方表中詳解

前言自從學過儲存過程後，就再也沒有碰過儲存過程，這是畢業後寫的第一個儲存過程。

一文就能看懂的Nginx操作詳解，你還在查漏補缺嗎！

安裝 nginx 下載 nginx 的壓縮包檔案到根目錄， yum update #更新系統軟體 cd / wget nginx.org/download/nginx-1.17.2.tar.gz

Python使用擴充套件庫pywin32實現批量文件列印例項

本文程式碼需要正確安裝Python擴充套件庫pywin32，建議下載whl檔案進行離線安裝。然後呼叫win32api的ShellExecute()函式來實現文件列印，系統會根據文件型別自動選擇不同的軟體進行開啟並自動列印，如果要列印的是圖

Python中list迴圈遍歷刪除資料的正確方法

前言初學Python，遇到過這樣的問題，在遍歷list的時候，刪除符合條件的資料，可是總是報異常，程式碼如下：

常用python爬蟲庫介紹與簡要說明

這個列表包含與網頁抓取和資料處理的Python庫 python網路庫通用 urllib -網路庫(stdlib)。

使用Python爬蟲庫requests傳送請求、傳遞URL引數、定製headers

首先我們先引入requests模組 import requests 一、傳送請求 r = requests.get(\'https://api.github.com/events\') # GET請求

Python爬蟲庫requests獲取響應內容、響應狀態碼、響應頭

首先在程式中引入Requests模組 import requests 一、獲取不同型別的響應內容在傳送請求後，伺服器會返回一個響應內容，而且requests通常會自動解碼響應內容

使用Python爬蟲庫requests傳送表單資料和JSON資料

匯入Python爬蟲庫Requests import requests 一、傳送表單資料要傳送表單資料，只需要將一個字典傳遞給引數data

python爬蟲庫scrapy簡單使用例項詳解

最近因為專案需求，需要寫個爬蟲爬取一些題庫。在這之前爬蟲我都是用node或者php寫的。一直聽說python寫爬蟲有一手，便入手了python的爬蟲框架scrapy.

python爬蟲開發之使用python爬蟲庫requests，urllib與今日頭條搜尋功能爬取搜尋內容例項

使用python爬蟲庫requests，urllib爬取今日頭條街拍美圖程式碼均有註釋 import re,json,requests,os

python爬蟲開發之使用Python爬蟲庫requests多執行緒抓取貓眼電影TOP100例項

使用Python爬蟲庫requests多執行緒抓取貓眼電影TOP100思路：檢視網頁原始碼抓取單頁內容

一個可以選擇目錄生成doc目錄內容的小工具(二)-os庫目錄遍歷

目錄遍歷是一個經典話題，花些功夫也很值得。（好在之前瞭解過）實現目錄遍歷的方式有三種，遞迴、棧、佇列。

python生成指定大小的txt文件(MB）

前言在測試過程中經常遇到檔案上傳的功能，檔案的大小邊界值測試一直沒有好的解決辦法，這裡我分享一個建立檔案的指令碼希望對大家有幫助。

使用Python爬蟲庫BeautifulSoup遍歷文件樹並對標籤進行操作詳解

相關推薦