1. 程式人生 > >爬蟲基礎-requests類庫

爬蟲基礎-requests類庫

簡介

python內建的HTTP請求庫,用於HTTP的請求和響應處理。程式碼比urllib類庫要簡潔很多

請求型別

  • requests.post
  • requests.get
  • requests.delete
  • requests.head
  • requests.options

post和get 請求在方法和使用上是一樣的,這裡以get請求做示例

get請求方法

方法一(url後跟params引數):

#-*- coding: utf-8 -*-
import requests
r = requests.get('http://www.python.org?name=user&age=18')
print(r.text)

方法二(字典形式儲存params引數):

#-*- coding: utf-8 -*-
import requests
data = {'name':'user','age':'18'}
r = requests.get('http://www.python.org',params=data)
print(r.text)

get請求示例

get常用請求

import requests
r = requests.get('http://www.python.org')
print("型別:",type(r))
print("狀態",r.status_code)
print("響應頭",r.headers)
print("響應體型別",type(r.text))
print("響應體",r.text)
print("cookies",r.cookies)
型別: <class 'requests.models.Response'>
狀態 200
響應頭 {'Server': 'nginx', 'Content-Type': 'text/html; charset=utf-8', 'X-Frame-Options': 'DENY', 'Via': '1.1 vegur, 1.1 varnish, 1.1 varnish', 'Content-Length': '49230', 'Accept-Ranges': 'bytes', 'Date': 'Fri, 17 May 2019 03:50:39 GMT', 'Age': '1597', 'Connection': 'keep-alive', 'X-Served-By': 'cache-iad2142-IAD, cache-tyo19921-TYO', 'X-Cache': 'HIT, HIT', 'X-Cache-Hits': '1, 2272', 'X-Timer': 'S1558065040.883277,VS0,VE0', 'Vary': 'Cookie', 'Strict-Transport-Security': 'max-age=63072000; includeSubDomains'}
響應體 <html>...略...</html>
響應體型別 <class 'str'>
cookies <RequestsCookieJar[]>

以上可以看到響應體型別是字串格式,可以通過json()方法轉換為json格式,防止解析錯誤,丟擲json.decoder.JSONDecodeError異常

get請求配合re模組過濾內容

#-*- coding: utf-8 -*-
import requests,re

# 請求頭
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
r = requests.get('https://www.douban.com/',headers=headers)         # get請求
a_tag = re.findall("<a.*</a>",r.text)                               # 正則過濾請求內容
print("網頁所有a標籤:",a_tag)
網頁所有a標籤: ['<a target="_blank" class="lnk-book" href="https://book.douban.com">豆瓣讀書</a>', '<a target="_blank" class="lnk-movie" href="https://movie.douban.com">豆瓣電影</a>', '<a target="_blank" class="lnk-music" href="https://music.douban.com">豆瓣音樂</a>',...略...]

get請求二進位制資料(示例為圖片)

#-*- coding: utf-8 -*-
import requests
r = requests.get('http://www.soso.com/soso/images/favicon_new.ico')
print("按文字方式提取",r.text)
print("bytes型別提取:",r.content)
按文字方式提取          h     (                                ��� ��� ��� ��� ������]���‰���±�������
bytes型別提取: b'\x00\x00\x01\x00\x01\x00\x10\x10\x00\x00\x01\x00 \x00h\x04\x00\x00\x16\x00\x00\x00(\x00\x00\x00\x10\x00\x00\x00 \x00\x00\x00\x01'

post請求示例

普通post請求和方法略,僅需把get請求替換為post請求即可,以下示例適用於大部分以post請求提交的場景

post請求提交資料

import requests
data = {'name':'user','age':'18'}
r = requests.post('http://www.aaa.com,data=data')

post請求上傳檔案

import requests
files = {'file':open('favicon.ico','rb')}
r = requests.post('http://www.aaa.com,files=files')

獲取Cookies

cookies物件獲取,也可以在瀏覽器開發者模式Network > Headers 中檢視Cookie值

#-*- coding: utf-8 -*-
import requests

r = requests.get("https://www.zhihu.com/")
print("cookies值物件:",r.cookies)

# 遍歷輸出cookie
for key,value in r.cookies.items():
    print("cookies值:",key + '=' + value)
cookies值物件: <RequestsCookieJar[<Cookie _xsrf=ccDJCJBWZFXbSdOIqS5fVks34CVhFwQ8 for .zhihu.com/>, <Cookie tgw_l7_route=a37704a413efa26cf3f23813004f1a3b for www.zhihu.com/>]>
cookies值: _xsrf=ccDJCJBWZFXbSdOIqS5fVks34CVhFwQ8
cookies值: tgw_l7_route=a37704a413efa26cf3f23813004f1a3b

攜帶cookies訪問(以下僅為小示例,當然現在很多網站都做了防爬蟲處理,需要加入Authorization認證。防爬蟲處理後邊會寫到,這裡僅做寫法樣式展示)

#-*- coding: utf-8 -*-
import requests
headers={
    'Cookies':'tgw_l7_route=116a747939468d99065d12a386ab1c5f; _xsrf=PZRQSNJ7BbRYSF3C43RmJqeJg5IDjQm6; tst=r; q_c1=3cd18195e35d4c0f8ec94ec70bbec158|1558078820000|1558078820000',
    'Host': 'www.aaa.com',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',
}
r = requests.get('https://www.aaa.com/',headers=headers)
print(r.text)

session會話保持

沒有使用session會話保持測試

#-*- coding: utf-8 -*-
import requests

r = requests.get('http://www.httpbin.org/cookies/set/number/123456')
print("第一次登入訪問:",r.text)

r2 = requests.get('http://www.httpbin.org/cookies')
print("訪問該域名另外一個連線:",r2.text)
第一次登入訪問: {
  "cookies": {
    "number": "123456"
  }
}

訪問該域名另外一個連線: {
  "cookies": {}
}

使用session會話保持,就不用每次請求都攜帶cookies值了

#-*- coding: utf-8 -*-
import requests
s = requests.Session()                          # 會話保持
r = s.get('http://www.httpbin.org/cookies/set/number/123456')
print("第一次登入訪問:",r.text)

r2 = s.get('http://www.httpbin.org/cookies')
print("訪問該域名另外一個連線:",r2.text)
第一次登入訪問: {
  "cookies": {
    "number": "123456"
  }
}

訪問該域名另外一個連線: {
  "cookies": {
    "number": "123456"
  }
}

SSL警告和忽略

request防止因為ssl配置失效或其它原因請求被攔截:

#-*- coding: utf-8 -*-
import requests
from requests.packages import urllib3
urllib3.disable_warnings()
r = requests.get('https://www.python.org/',verify=False)
print(r.status_code)

代理設定

普通代理設定參考

#-*- coding: utf-8 -*-
import requests
proxies = {
    "http":"http://10.0.0.10:3128",
    "https":"https//10.0.0.10:1080",
}
requests.get("https://www.python.org",proxies=proxies)

包含 HTTP Basic Auth 認證的代理設定參考

#-*- coding: utf-8 -*-
import requests
proxies = {
    'http':'http://user:password@host:port',
    'https':'http://user:password@host:port',
}
requests.get("https://www.python.org",proxies=proxies)

SOCKS協議代理(需要pip安裝'requests[socks]')

#-*- coding: utf-8 -*-
import requests
proxies = {
    'http':'socks5://user:password@host:port',
    'https':'socks5://user:password@host:port',
}
requests.get("https://www.python.org",proxies=proxies)

超時設定

#-*- coding: utf-8 -*-
import requests
r = requests.get('http://www.python.org',timeout=10)
print(r.status_code)

身份認證

HTTPBasicAuth 認證

#-*- coding: utf-8 -*-
import requests
from requests.auth import HTTPBasicAuth
r = requests.get('http://localhost:5000',auth=HTTPBasicAuth('username','password'))
print(r.status_code)

requests預設認證為HTTPBasicAuth,所有有更簡潔的寫法

#-*- coding: utf-8 -*-
import requests
r = requests.get('http://localhost:5000',auth=('username','password'))
print(r.status_code)

其它身份認證(以OAuth認證為例,需要先pip安裝requests_oauthlib模組)

#-*- coding: utf-8 -*-
import requests
from requests_oauthlib import OAuth1
url = 'https://api.twitter.com/1.1/accout/verfify_crdentials.json'
auth = OAuth1('YOUR_APP_KEY','YOUR_APP_SECRET','USER_OAUTH_TOKEN','USER_OAUTH_TOKEN_SECRET')
requests.get(url,auth=auth)

Prepared Request 資料結構

將請求表示為資料結構,其中各個引數通過一個 Prepared Request 物件來表示

#-*- coding: utf-8 -*-
from requests import Request,Session
url = 'http://httpbin.org/post'
data = {'name':'user'}
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
s = Session()                                           # 例項化session物件
req = Request('POST',url,data=data,headers=headers)     # 例項化Request物件
prepped = s.prepare_request(req)                        # 使用session的prepare_request轉換為Prepared Request物件
r = s.send(prepped)                                     # 呼叫s.send()方法提交發送資料
print(r.text)
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "user"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "9", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"
  }, 
  "json": null, 
  "origin": "119.98.241.138, 119.98.241.138", 
  "url": "https://httpbin.org/post"
}

抓取示例

抓取貓眼電影資訊

#-*- coding: utf-8 -*-
import requests,re,json

def get_one_page(url,headers):
    r = requests.get(url,headers=headers)                                               # 請求地址
    return r.text
def video_name(html):
    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a.*?>(.*?)</a>.*?star">[\n]*(.*?)</p>.*?releasetime">(.*?)</p>',re.S)        # 自定義正則匹配
    s = re.findall(pattern,html)                                                        # 正則查詢
    return s
def write_to_file(content):
    with open('result.txt','a',encoding='utf8') as f:                                   # 新建檔案
        for i in content:                                                               # 遍歷內容便於儲存
            f.write(json.dumps(i,ensure_ascii=False)+'\n')                              # json格式逐條寫入資料到檔案
def main():
    url = 'https://maoyan.com/board/4'
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
    html = get_one_page(url,headers)                                                    # 執行請求地址函式
    con = video_name(html)                                                              # 正則匹配過濾內容
    print("匹配到的內容,排行/連結/名稱/主演/上映時間:",con)
    write_to_file(con)                                                                  # 寫入檔案

main()
["1", "https://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c", "霸王別姬", "                主演:張國榮,張豐毅,鞏俐\n        ", "上映時間:1993-01-01"]
["2", "https://p0.meituan.net/movie/283292171619cdfd5b240c8fd093f1eb255670.jpg@160w_220h_1e_1c", "肖申克的救贖", "                主演:蒂姆·羅賓斯,摩根·弗里曼,鮑勃·岡頓\n        ", "上映時間:1994-09-10(加拿大)"]
["3", "https://p0.meituan.net/movie/289f98ceaa8a0ae737d3dc01cd05ab052213631.jpg@160w_220h_1e_1c", "羅馬假日", "                主演:格利高裡·派克,奧黛麗·赫本,埃迪·艾伯特\n        ", "上映時間:1953-09-02(美國)"]
["4", "https://p1.meituan.net/movie/6bea9af4524dfbd0b668eaa7e187c3df767253.jpg@160w_220h_1e_1c", "這個殺手不太冷", "                主演:讓·雷諾,加里·奧德曼,娜塔莉·波特曼\n        ", "上映時間:1994-09-14(法國)"]
["5", "https://p1.meituan.net/movie/b607fba7513e7f15eab170aac1e1400d878112.jpg@160w_220h_1e_1c", "泰坦尼克號", "                主演:萊昂納多·迪卡普里奧,凱特·溫絲萊特,比利·贊恩\n        ", "上映時間:1998-04-03"]
["6", "https://p0.meituan.net/movie/da64660f82b98cdc1b8a3804e69609e041108.jpg@160w_220h_1e_1c", "唐伯虎點秋香", "                主演:周星馳,鞏俐,鄭佩佩\n        ", "上映時間:1993-07-01(中國香港)"]
["7", "https://p0.meituan.net/movie/46c29a8b8d8424bdda7715e6fd779c66235684.jpg@160w_220h_1e_1c", "魂斷藍橋", "                主演:費雯·麗,羅伯特·泰勒,露塞爾·沃特森\n        ", "上映時間:1940-05-17(美國)"]
["8", "https://p0.meituan.net/movie/223c3e186db3ab4ea3bb14508c709400427933.jpg@160w_220h_1e_1c", "亂世佳人", "                主演:費雯·麗,克拉克·蓋博,奧利維婭·德哈維蘭\n        ", "上映時間:1939-12-15(美國)"]
["9", "https://p1.meituan.net/movie/ba1ed511668402605ed369350ab779d6319397.jpg@160w_220h_1e_1c", "天空之城", "                主演:寺田農,鷲尾真知子,龜山助清\n        ", "上映時間:1992"]
["10", "https://p0.meituan.net/movie/b0d986a8bf89278afbb19f6abaef70f31206570.jpg@160w_220h_1e_1c", "辛德勒的名單", "                主演:連姆·尼森,拉爾夫·費因斯,本·金斯利\n        ", "上映時間:1993-12-15(美