1. 程式人生 > >scrapy 爬取知乎登入認證部分(採用cookie登入)

scrapy 爬取知乎登入認證部分(採用cookie登入)

scrapy 爬蟲,為非同步io框架;因此此處選擇,先用requests請求,儲存cookie檔案,然後scrapy爬取前,在入口處載入cookie。
* 登入,儲存cookie方法見前兩節,此處展示的是scrapy讀取cookie
* 首先要明確,scrapy框架關於登入的一些基礎知識
1. scrapy 爬取有一個入口,在入口處載入cookie,後續的爬取會自動載入cookie資訊
2. scrapy cookie的載入時接收字典形式的,因此,和requests載入cookie不同,要先轉化為字典

載入cookie檔案,並轉化為字典形式

#!/usr/bin/env python3.6
# -*- coding: utf-8 -*- # @Time : 2018/4/18 18:10 # @Author : ysj import os import http.cookiejar as cookielib import requests from scrapy_article.utl.ZhiHuLogIn import ZhiHu cookie_file = os.path.join(os.path.dirname(__file__), 'zhihu_cookie.txt') def load_cookie(filename=cookie_file): """載入cookie檔案"""
cookie = cookielib.LWPCookieJar() cookie.load(filename, ignore_discard=True, ignore_expires=True) return requests.utils.dict_from_cookiejar(cookie) def get_dict_cookie(account='18516157608', password='******'): """" 呼叫前一節中的登入,""" login = ZhiHu(account, password) return requests.utils.dict_from_cookiejar(login.session.cookies) if
__name__ == '__main__': print(load_cookie()) print(get_dict_cookie('18516157608', '*****'))

scrapy 入口載入cookie

主要是重寫scrapy入口請求,新增cookie和headers
篇幅問題,此處指顯示了重寫的入口函式和一個登入是否成功的驗證函式;需要提前載入上文load_cookie函式

class ZhihuSpider(scrapy.Spider):
    name = 'zhihu'
    allowed_domains = ['www.zhihu.com']
    start_urls = ['https://www.zhihu.com/']

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
        'HOST': 'www.zhihu.com',
        'Referer': 'https://www.zhihu.com/signin?next=%2F',
        'Authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20'
    }
    login_url = 'https://www.zhihu.com/api/v3/oauth/sign_in'
    captcha_url = 'https://www.zhihu.com/api/v3/oauth/captcha?lang=en'
    check_url = 'https://www.zhihu.com/inbox'
    # check_url = 'https://www.zhihu.com/'
    def start_requests(self):
        self.post_data['signature'] = self.get_signature()

        return [scrapy.Request(url=self.check_url, headers=self.headers, method='GET', meta={'dont_redirect': True},
                               cookies=load_cookie(), callback=self.check_login)]

    def check_login(self, response):
        """用requests登入過,會儲存cookie,傳入scrapy使用,暫時不考慮scrapy登入"""
        if response.status < 300:
            print('登入成功')
            for url in self.start_urls:
                yield scrapy.Request(url, dont_filter=True, headers=self.headers)
        else:
            print('未登入,即將拉取驗證碼,重新登入')
            yield [scrapy.Request(url=self.check_url, headers=self.headers, method='GET', meta={'dont_redirect': True},
                                  cookies=get_dict_cookie(), callback=self.check_login)]