scrapy 爬取知乎登入認證部分(採用cookie登入)
阿新 • • 發佈:2019-02-09
scrapy 爬蟲,為非同步io框架;因此此處選擇,先用requests請求,儲存cookie檔案,然後scrapy爬取前,在入口處載入cookie。
* 登入,儲存cookie方法見前兩節,此處展示的是scrapy讀取cookie
* 首先要明確,scrapy框架關於登入的一些基礎知識
1. scrapy 爬取有一個入口,在入口處載入cookie,後續的爬取會自動載入cookie資訊
2. scrapy cookie的載入時接收字典形式的,因此,和requests載入cookie不同,要先轉化為字典
載入cookie檔案,並轉化為字典形式
#!/usr/bin/env python3.6
# -*- coding: utf-8 -*-
# @Time : 2018/4/18 18:10
# @Author : ysj
import os
import http.cookiejar as cookielib
import requests
from scrapy_article.utl.ZhiHuLogIn import ZhiHu
cookie_file = os.path.join(os.path.dirname(__file__), 'zhihu_cookie.txt')
def load_cookie(filename=cookie_file):
"""載入cookie檔案"""
cookie = cookielib.LWPCookieJar()
cookie.load(filename, ignore_discard=True, ignore_expires=True)
return requests.utils.dict_from_cookiejar(cookie)
def get_dict_cookie(account='18516157608', password='******'):
"""" 呼叫前一節中的登入,"""
login = ZhiHu(account, password)
return requests.utils.dict_from_cookiejar(login.session.cookies)
if __name__ == '__main__':
print(load_cookie())
print(get_dict_cookie('18516157608', '*****'))
scrapy 入口載入cookie
主要是重寫scrapy入口請求,新增cookie和headers
篇幅問題,此處指顯示了重寫的入口函式和一個登入是否成功的驗證函式;需要提前載入上文load_cookie函式
class ZhihuSpider(scrapy.Spider):
name = 'zhihu'
allowed_domains = ['www.zhihu.com']
start_urls = ['https://www.zhihu.com/']
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
'HOST': 'www.zhihu.com',
'Referer': 'https://www.zhihu.com/signin?next=%2F',
'Authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20'
}
login_url = 'https://www.zhihu.com/api/v3/oauth/sign_in'
captcha_url = 'https://www.zhihu.com/api/v3/oauth/captcha?lang=en'
check_url = 'https://www.zhihu.com/inbox'
# check_url = 'https://www.zhihu.com/'
def start_requests(self):
self.post_data['signature'] = self.get_signature()
return [scrapy.Request(url=self.check_url, headers=self.headers, method='GET', meta={'dont_redirect': True},
cookies=load_cookie(), callback=self.check_login)]
def check_login(self, response):
"""用requests登入過,會儲存cookie,傳入scrapy使用,暫時不考慮scrapy登入"""
if response.status < 300:
print('登入成功')
for url in self.start_urls:
yield scrapy.Request(url, dont_filter=True, headers=self.headers)
else:
print('未登入,即將拉取驗證碼,重新登入')
yield [scrapy.Request(url=self.check_url, headers=self.headers, method='GET', meta={'dont_redirect': True},
cookies=get_dict_cookie(), callback=self.check_login)]