初識網路爬蟲-網路爬蟲概述
阿新 • • 發佈:2018-12-22
3.1 網路爬蟲概述
3.1.1 網路爬蟲及其應用
分類:通用,聚焦,增量,深層
搜尋引擎:通用網咯爬蟲
定向抓取相關網頁中資源:聚焦爬蟲
增量式爬蟲:針對已經更新的網頁資源
深層網路爬蟲:隱藏在表層連結後面的web頁面
網路爬蟲實際運用場景:BT網站;雲盤搜尋;
3.1.2 網路爬蟲結構
3.2 HTTP請求python實現
三種方式:urllib2/urllib,httplib/urllib以及Requests
3.2.1 urllib2/urllib實現
1.向指定的url發出請求:
import urlliib2 response = urllib2.urlopen('http://www.zhihu.com') html = response.read() print(html)
分解為請求和響應:
import urllib2
#請求
request = urllib2.Request('http://www.zhihu.com')
#響應
response = urllib2.urlopen(request)
html = response.read()
print(html)
POST請求,新增請求資料
import urllib import urllib2 url = 'http://www.zhihu.com' postdata = {'username','qiye','passward','qiye-pass'} data = urllib.urlcode(poatdata) req = urllib2.Request(url,data) response = urllib2.urlopen(req) html = response.read()
2.請求頭headers處理
import urllib import urllib2 url = 'http://www.zhihu.com' user_agent = '...' referer = '...' postdata = {...} #將user-agent和referer寫入頭資訊 headers = {'User-Agent':user-agent,'Referer:referer} data = urllib.urlcode(poatdata) req = urllib2.Request(url,data,headers) response = urllib2.urlopen(req) html = response.read()
3.cookie處理
得到某個cookie項的值
import urllib2
import cookielib
cookie = cookielib.CookieJar()
#設定開啟方式
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open('...')
for item in cookie:
print item.name+':'+item.value
#自己新增cookie內容
import urllib2
opener = urllib2.build_opener()
opener.addheaders.append(('Cookie','email='+"..."))
req = urllib2.Request("...")
response = opener.open(req)
print(response.headers)
retdata = response.read()
4.設定超時資訊Timeout
import urllib2
request = urllib2.Request('...')
response = urllib2.urlopen(request,timeout=2)
html = response.read()
print(html)
5.獲取HTTP響應碼
import urllib2
try:
response = urllib2.urlopen('...')
print(response)
except urllib2.HTTPError as e:
if hasattr(e,'code'):
print('Error code:',e.code)
6.重定向
7.Proxy的設定
import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener([proxy,])
urllib2.install_operner(opener)
response = urllib2.urlopen('...')
print(response.read())
3.2.2 http/urllib實現
httplib模組式一個底層模組,可以看到HTTP請求的每一步,在爬蟲開發過程基本用不到,這裡進行知識普及:
3.2.3 更人性化的Requests
1.完整請求響應模型
GET:
import requests
r = requests.get('...')
print(r.content)
POST:
import requests
postdata = {...}
r = requests.post('...',data=postdata)
print(r.content)
2.響應與編碼
import requests
r = resquests.get('...')
print('content-->>'+r.content)
print('text-->>'+r.text)
print('encoding-->>'+r.encoding)
r.encoding = 'utf-8'
print('new text -->>'+r.text)
字串/檔案編碼檢測模組chardet
直接將chardet檢測到的編碼,賦值給r.encoding實現編碼,r.text輸出就不會有亂碼
import requests
r = request.get('...')
print(chardet.detect(r.content))
r.encoding = chardet.detect(r.content)['encoding']
print(r.text)
3.請求頭headers處理
import requests
user_agent = '...'
headers = {'User-Agent':user_agent}
r = requests.get('...',headers = headers)
print(r.content)
4.響應碼code和響應頭headers的處理
獲取響應碼:status_code欄位
獲取響應頭:headers欄位
import requests
r = requests.get('...')
if r.status_code == requests.codes.OK:
print(r.status_code)#獲取響應碼
print(r.headers)#獲取響應頭
print(r.headers.get('content-type'))#獲取其中欄位
else:
r.raise_for_status()#主動丟擲異常
5.cookie處理
獲取cookie欄位
import requests
user_agent = '...'
headers = {'User-Agent':user-agent}
r = requests.get('...',headers = headers)
for cookie in r.cookie.keys():
print(cookie+':'+r.cookie.get(cookie))
新增自定義cookie
import requests
user_agent = '...'
headers = {'User-Agent':user-agent}
cookies = dict(name='qiye',age='10')
r = requests.get('...',headers = headers,cookies = cookies)
print(r,text)
Requests提供session概念自動給程式新增cookies
import requests
loginurl = '...'
s = requests.Session()
#首先訪問登陸介面作為遊客,伺服器會分配一個cookie
r = s.get(loginurl,allow_redirects=True)
datas = {'name':'qiye',apsswd':'qiye'}
#向登入連結傳送post請求,驗證成功,遊客許可權轉為會員許可權
r = s.post(loginurl,data,allow+True)
print(r,text)
6.重定向與歷史資訊
處理重定向:allow_redirects
檢視歷史資訊:r.history
import requests
r = requests.get('...')
print(r.url)
print(r.status_code)
print(r.history)
7.超時設定
requests.get('...',timeout=2)
8.代理設定
import requests
proxies = {"....","......"}
requests.get("...",proxies = proxies)