1. 程式人生 > 實用技巧 >實戰單執行緒爬取, 單執行緒+協程爬取, 多執行緒爬取

實戰單執行緒爬取, 單執行緒+協程爬取, 多執行緒爬取

一.目標網頁:https://lusongsong.com/default_2.html.爬取該頁面連結(有17個)下詳情內容並儲存到本地

二 分別採取單執行緒爬取,多執行緒爬取,單執行緒+協程爬取

  2.1 單執行緒爬取

import requests
from lxml import etree
import time

def get_request(url):
response = requests.get(url).text
return response

def parse(html):
tree = etree.HTML(html)
title = tree.xpath('//div[@class="post-title"]/h1/a/text()')[0]
text = tree.xpath('//dd[@class="con"]/p/text()')
text = "".join(text)
with open('1.txt','a+',encoding='utf-8') as fp:
fp.write(title + '\n' + text + '\n')

if __name__ == '__main__':
start = time.time()
# p = Pool(3)
index = requests.get('https://lusongsong.com/default_2.html').text
tree = etree.HTML(index)
urls = tree.xpath('//div[@class="post"]/h2/a/@href')
for url in urls:
c = get_request(url)
parse(c)
print('總耗時:',time.time()-start)

總耗時: 13.49609375
2.2.多執行緒爬取:先用requests模組獲取頁面連結到列表,再用多工協程爬蟲爬取內容
import requests
from lxml import etree
import time
from multiprocessing.dummy import Pool
def get_request(url):
response = requests.get(url).text
return response

def parse(html):
tree = etree.HTML(html)
title = tree.xpath('//div[@class="post-title"]/h1/a/text()')[0]
text = tree.xpath('//dd[@class="con"]/p/text()')
text = "".join(text)
with open('1.txt','a+',encoding='utf-8') as fp:
fp.write(title + '\n' + text + '\n')


if __name__ == '__main__':
start = time.time()
p = Pool(3)
index = requests.get('https://lusongsong.com/default_2.html').text
tree = etree.HTML(index)
urls = tree.xpath('//div[@class="post"]/h2/a/@href')
res_list = p.map(get_request,urls)
for res in res_list:
parse(res)

print('總耗時:',time.time()-start)

總耗時: 1.737304925918579

2.3 單執行緒+協程爬取
import time
import asyncio
import aiohttp
from lxml import etree
import requests

async def get_request(url):
async with aiohttp.ClientSession() as sess:
async with await sess.get(url=url) as response:
page_text = await response.text()
return page_text

def parse(task):
page_text = task.result()
tree = etree.HTML(page_text)
title = tree.xpath('//div[@class="post-title"]/h1/a/text()')[0]
text = tree.xpath('//dd[@class="con"]/p/text()')
text = "".join(text)
with open('3.txt','a+',encoding='utf-8') as fp:
fp.write(title + '\n' + text + '\n')


if __name__ == '__main__':
start = time.time()
index = requests.get('https://lusongsong.com/default_2.html').text
tree = etree.HTML(index)
urls = tree.xpath('//div[@class="post"]/h2/a/@href')
tasks = []
for url in urls:
c = get_request(url)
task = asyncio.ensure_future(c)
task.add_done_callback(parse)
tasks.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
print('總耗時:',time.time()-start)
總耗時: 0.5029296875

結論:對於網路爬蟲,多執行緒和協程能夠有效提升效率,原因:單執行緒下有IO操作會進行IO等待,造成不必要的時間浪費,而開啟多執行緒能線上程A等待時,自動切換到執行緒B,可以不浪費CPU的資源,從而能提升程式執行效率。