Python製作一款爬蟲軟體，爬取公眾號文章，爬蟲之路，永無止境

阿新 • • 發佈：2021-11-08

今天拿手機看公眾號裡面的文章，不小心退出來，進去之後還得一頁一頁地翻，好麻煩，突發奇想，把資訊爬下來，想看哪個看哪個。。嘿嘿，來自程式設計師的快樂。

爬蟲操作演示

電腦卡，各位別見怪。。。

很多人學習python，不知道從何學起。

很多人學習python，掌握了基本語法過後，不知道在哪裡尋找案例上手。

很多已經做案例的人，卻不知道如何去學習更加高深的知識。

那麼針對這三類人，我給大家提供一個好的學習平臺，免費領取視訊教程，電子書籍，以及課程的原始碼！

QQ群：861355058

歡迎加入，一起討論 一起學習！

開發工具

python
pycharm
selenium
tkinter
xlwt

開發思路

首先

start_url="https://mp.weixin.qq.com/"

掃碼註冊一下微信公眾平臺，有的話直接忽略，掃碼登入即可。（註冊個人訂閱號就行）

利用selenium自動操作掃碼登入獲得cookie值，之後響應要用cookie

要先下載webdriver外掛

外掛你下載對應谷歌瀏覽器的版本，下載之後會獲得chromedriver.exe，然後把這個chromedriver.exe放在python直譯器的python.exe檔案的同級目錄下就可以了

登入進去介面為：

響應拿回網頁原始碼，拿回token值，token值是有時效性的

操作點開要搜尋公眾號的位置

搜尋想要爬取的公眾號名字

右擊開啟檢查，拿回fakeid值，確定公眾號，具有唯一性

本文以CSDN為例，爬取公眾號的文章

翻頁
開啟headers，拿回第一頁的requests url

https://mp.weixin.qq.com/cgi-bin/appmsg?action=list_ex&begin=0&count=5&fakeid=MjM5MjAwODM4MA==&type=9&query=&token=1008822872&lang=zh_CN&f=json&ajax=1

拿回第二頁的地址

https://mp.weixin.qq.com/cgi-bin/appmsg?action=list_ex&begin=5&count=5&fakeid=MjM5MjAwODM4MA==&type=9&query=&token=1008822872&lang=zh_CN&f=json&ajax=1

對比可以發現begin引數以5的速度增長

直接原始碼展示

# !/usr/bin/nev python
# -*-coding:utf8-*-

import tkinter as tk
from selenium import webdriver
import time, re, jsonpath, xlwt
from requests_html import HTMLSession
session = HTMLSession()


class GZHSpider(object):

    def __init__(self):
        """定義視覺化視窗，並設定視窗和主題大小布局"""
        self.window = tk.Tk()
        self.window.title('公眾號資訊採集')
        self.window.geometry('800x600')

        """建立label_user按鈕，與說明書"""
        self.label_user = tk.Label(self.window, text='需要爬取的公眾號：', font=('Arial', 12), width=30, height=2)
        self.label_user.pack()
        """建立label_user關聯輸入"""
        self.entry_user = tk.Entry(self.window, show=None, font=('Arial', 14))
        self.entry_user.pack(after=self.label_user)

        """建立label_passwd按鈕，與說明書"""
        self.label_passwd = tk.Label(self.window, text="爬取多少頁：（小於100）", font=('Arial', 12), width=30, height=2)
        self.label_passwd.pack()
        """建立label_passwd關聯輸入"""
        self.entry_passwd = tk.Entry(self.window, show=None, font=('Arial', 14))
        self.entry_passwd.pack(after=self.label_passwd)

        """建立Text富文字框，用於按鈕操作結果的展示"""
        self.text1 = tk.Text(self.window, font=('Arial', 12), width=85, height=22)
        self.text1.pack()

        """定義按鈕1，繫結觸發事件方法"""

        self.button_1 = tk.Button(self.window, text='爬取', font=('Arial', 12), width=10, height=1,
                                  command=self.parse_hit_click_1)
        self.button_1.pack(before=self.text1)

        """定義按鈕2，繫結觸發事件方法"""
        self.button_2 = tk.Button(self.window, text='清除', font=('Arial', 12), width=10, height=1,
                                  command=self.parse_hit_click_2)
        self.button_2.pack(anchor="e")


    def parse_hit_click_1(self):
        """定義觸發事件1,呼叫main函式"""
        user_name = self.entry_user.get()
        pass_wd = int(self.entry_passwd.get())
        self.main(user_name, pass_wd)

    def main(self, user_name, pass_wd):
        # 網頁登入
        driver_path = r'D:\python\chromedriver.exe'
        driver = webdriver.Chrome(executable_path=driver_path)
        driver.get('https://mp.weixin.qq.com/')
        time.sleep(2)
        # 網頁最大化
        driver.maximize_window()
        # 拿微信掃描登入
        time.sleep(20)
        # 獲得登入的cookies
        cookies_list = driver.get_cookies()
        # 轉化成能用的cookie格式
        cookie = [item["name"] + "=" + item["value"] for item in cookies_list]
        cookie_str = '; '.join(item for item in cookie)
        # 請求頭
        headers_1 = {
            'cookie': cookie_str,
            'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                         'Chrome/91.0.4472.77 Safari/537.36'
        }
        # 起始地址
        start_url = 'https://mp.weixin.qq.com/'
        response = session.get(start_url, headers=headers_1).content.decode()
        # 拿到token值，token值是有時效性的
        token = re.findall(r'token=(\d+)', response)[0]
        # 搜尋出所有跟輸入的公眾號有關的
        next_url = f'https://mp.weixin.qq.com/cgi-bin/searchbiz?action=search_biz&begin=0&count=5&query={user_name}&token=' \
                   f'{token}&lang=zh_CN&f=json&ajax=1'
        # 獲取響應
        response_1 = session.get(next_url, headers=headers_1).content.decode()
        # 拿到fakeid的值，確定公眾號，唯一的
        fakeid = re.findall(r'"fakeid":"(.*?)",', response_1)[0]
        # 構造公眾號的url地址
        next_url_2 = 'https://mp.weixin.qq.com/cgi-bin/appmsg?'
        data = {
            'action': 'list_ex',
            'begin': '0',
            'count': '5',
            'fakeid': fakeid,
            'type': '9',
            'query': '',
            'token': token,
            'lang': 'zh_CN',
            'f': 'json',
            'ajax': '1'
        }
        headers_2 = {
            'cookie': cookie_str,
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/91.0.4472.77 Safari/537.36',
            'referer': f'https://mp.weixin.qq.com/cgi-bin/appmsgtemplate?action=edit&lang=zh_CN&token={token}',
            'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
            'sec-ch-ua-mobile': '?0',
            'sec-fetch-dest': 'empty',
            'sec-fetch-mode': 'cors',
            'sec-fetch-site': 'same-origin',
            'x-requested-with': 'XMLHttpRequest'
        }

        # 表的建立
        workbook = xlwt.Workbook(encoding='gbk', style_compression=0)
        sheet = workbook.add_sheet('test', cell_overwrite_ok=True)
        j = 1
        # 構造表頭
        sheet.write(0, 0, '時間')
        sheet.write(0, 1, '標題')
        sheet.write(0, 2, '地址')
        # 迴圈翻頁
        for i in range(pass_wd):
            data["begin"] = i * 5
            time.sleep(3)
            # 獲取響應的json資料
            response_2 = session.get(next_url_2, params=data, headers=headers_2).json()

            # jsonpath 獲取時間，標題，地址
            title_list = jsonpath.jsonpath(response_2, '$..title')
            url_list = jsonpath.jsonpath(response_2, '$..link')
            create_time_list = jsonpath.jsonpath(response_2, '$..create_time')

            # 將時間戳轉化為北京時間
            list_1 = []
            for create_time in create_time_list:
                time_local = time.localtime(int(create_time))
                time_1 = time.strftime("%Y-%m-%d", time_local)
                time_2 = time.strftime("%H:%M:%S", time_local)
                time_3 = time_1 + ' ' + time_2
                list_1.append(time_3)
            # for迴圈遍歷
            for times, title, url in zip(list_1, title_list, url_list):
                # 其中的'0-行, 0-列'指定表中的單元
                sheet.write(j, 0, times)
                sheet.write(j, 1, title)
                sheet.write(j, 2, url)
                j = j + 1
            # 視窗顯示程序
            self.text1.insert("insert", f'*****************第{i+1}頁爬取成功*****************')
            time.sleep(2)
            self.text1.insert("insert", '\n ')
            self.text1.insert("insert", '\n ')
        # 最後儲存成功
        workbook.save(f'{user_name}公眾號資訊.xls')
        print(f"*********{user_name}公眾號資訊儲存成功*********")


    def parse_hit_click_2(self):
        """定義觸發事件2，刪除文字框中內容"""
        self.entry_user.delete(0, "end")
        self.entry_passwd.delete(0, "end")
        self.text1.delete("1.0", "end")

    def center(self):
        """建立視窗居中函式方法"""
        ws = self.window.winfo_screenwidth()
        hs = self.window.winfo_screenheight()
        x = int((ws / 2) - (800 / 2))
        y = int((hs / 2) - (600 / 2))
        self.window.geometry('{}x{}+{}+{}'.format(800, 600, x, y))

    def run_loop(self):
        """禁止修改窗體大小規格"""
        self.window.resizable(False, False)
        """視窗居中"""
        self.center()
        """視窗維持--持久化"""
        self.window.mainloop()


if __name__ == '__main__':
    g = GZHSpider()
    g.run_loop()

程式碼寫完打包一下，就可以發給客戶了

Python製作一款爬蟲軟體，爬取公眾號文章，爬蟲之路，永無止境

不止留言，微信公眾號文章也將顯示 IP 屬地

感謝網友故事膠片的線索投遞！

Python爬蟲程式碼：雙十一到了，爬一下某東看看有沒有好東西，這不得買一波大的！

現在電商平臺有很多商品資料，採集到的資料對電商價格戰很有優勢，這不，雙十一預售都已經開啟了，不得對自己好一點，把購物車塞到滿滿當當。

Python爬蟲入門練手案例，爬取某乎問答數（附原始碼）

前言本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯絡我們以作處理。

Python爬蟲實戰：自動化登入網站，爬取商品資料

前言隨著網際網路時代的到來，人們更加傾向於網際網路購物。某東又是電商行業的巨頭，在某東平臺中有很多商家資料。今天帶大家使用python+selenium工具獲取這些公開的商家資料

PyQt5製作一個爬蟲小工具，爬取雪球網上市公司的財務資料

本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯絡我們以作處理。

python爬蟲教程：爬取酷狗音樂，零基礎小白也能爬取哦

本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯絡我們以作處理

python爬蟲——帶你爬取古詩名句，考試什麼的不就是輕輕鬆鬆

本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯絡我們以作處理

Python爬蟲：輸入公司名稱，爬取企查查網站中的公司資訊

前言本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯絡我們以作處理。

新年了，用Python製作一本女朋友專屬日曆給她

新年到了，很多小夥伴都會買上一本日曆。現在各種主題各種式樣的日曆有很多，不過你有沒有想過自己定製一套專屬的個性化電子日曆呢？今天我們就來演示如何用python生成一個日曆。

Python爬蟲實戰，Scrapy實戰，爬取並簡單分析知網中國專利資料

前言今天我們就用scrapy爬一波知網的中國專利資料並做簡單的資料視覺化分析唄。讓我們愉快地開始吧~

Python爬蟲，爬取網站圖片，詳細解釋（看完就會）

Xpath 解析圖片專案 # 指定url url = \'http://pic.netbian.com/4kyingshi/\' # UA偽裝 headers = { \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \