python 多線程方法爬取微信公眾號文章

阿新 • • 發佈：2018-06-05

微信爬蟲多線程爬蟲

本文在上一篇基礎上增加多線程處理（http://blog.51cto.com/superleedo/2124494 ）

執行思路：

1，規劃好執行流程，建立兩個執行線程，一個控制線程

2，線程1用於獲取url，並寫入urlqueue隊列

3，線程2，通過線程1的url獲取文章內容，並保存到本地文件中

4，線程3用於控制程序，保證1,2線程都執行完後退出

5，多線程退出程序，在子線程設置daemon為true，保證程序正常退出

6，添加異常處理，添加限時防止屏蔽

閑話不多說，上代碼

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re
import urllib.request
import time
import sys
import urllib.error
import threading
import queue

urlqueue=queue.Queue()
##模擬瀏覽器安裝headers
headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
urllib.request.install_opener(opener)
##設置列表用於存儲鏈接
listurl=[]

##定義代理服務器函數
#def use_proxy(proxy_addr,url):
#       try:
#               import urllib.request
#               proxy=urllib.request.ProxyHandler({'http':proxy_addr})
#               opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
#               urllib.request.install_opener(opener)
#               data=urllib.request.urlopen(url).read().decode('utf-8')
#               data=str(data)
#               return data
#       except urllib.error.URLError as e:
#               if hasattr(e,"code"):
#                       print(e.code)
#               if hasattr(e,"reason"):
#                       print(e.reason)
#               time.sleep(10)
#       except Exception as e:
#               print("exception"+str(e))
#               time.sleep(1)

##定義獲取頁面所有文章鏈接
class getlisturl(threading.Thread):
        def __init__(self,key,pagestart,pageend,urlqueue):
                threading.Thread.__init__(self)
                self.key=key
                self.pagestart=pagestart
                self.pageend=pageend
                self.urlqueue=urlqueue
        def run(self):

                page=self.pagestart
                keycode=urllib.request.quote(key)
        #       pagecode=urllib.request.quote("&page")
                for page in range(self.pagestart,self.pageend+1):
                        url="http://weixin.sogou.com/weixin?type=2&query="+keycode+"&page="+str(page)
                        data1=urllib.request.urlopen(url).read().decode('utf-8')
                        data1=str(data1)
                        listurlpat='<a data-z="art".*?(http://.*?)"'
                        listurl.append(re.compile(listurlpat,re.S).findall(data1))
                        time.sleep(2)
                print("共獲取到"+str(len(listurl))+"頁")
#               print("第2頁鏈接數"+str(len(listurl[1]))+"個")
#               return listurl
                for i in range(0,len(listurl)):
                        time.sleep(6)
                        for j in range(0,len(listurl[i])):
                                try:
                                        url=listurl[i][j]
                                        url=url.replace("amp;","")
                                        print("第"+str(i)+"i"+str(j)+"j次入隊")
                                        self.urlqueue.put(url)
                                        self.urlqueue.task_done()
                                except urllib.error.URLError as e:
                                        if hasattr(e,"code"):
                                                print(e.code)
                                        if hasattr(e,"reason"):
                                                print(e.reason)
                                        time.sleep(10)
                                except Exception as e:
                                        print("exception"+str(e))
                                        time.sleep(1)

##定義獲取文章內容
class getcontent(threading.Thread):
        def __init__(self,urlqueue):
                threading.Thread.__init__(self)
                self.urlqueue=urlqueue

        def run(self):
#               i = 0
                #設置本地文件中的開始html編碼
                html1 = '''
                                <!DOCTYPE html>
                                <html>
                                <head>
                                <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
                                <title>微信文章頁面</title>
                                </head>
                                <body>
                                '''
                fh=open("/home/urllib/test/1.html","wb")
                fh.write(html1.encode("utf-8"))
                fh.close()
                #再次以追加寫入的方式打開文件，以寫入對應文章內容
                fh=open("/home/urllib/test/1.html","ab")
                i=1
                while(True):
                        try:
                                url=self.urlqueue.get()
                                data=urllib.request.urlopen(url).read().decode('utf-8')
                                data=str(data)
                                titlepat='var msg_title = "(.*?)";'
                                contentpat='id="js_content">(.*?)id="js_sg_bar"'
                                title=re.compile(titlepat).findall(data)
                                content=re.compile(contentpat,re.S).findall(data)
                                #初始化標題與內容
                                thistitle = "此次沒有獲取到"
                                thiscontent= "此次沒有獲取到"
                                #如果標題列表不為空，說明找到了標題，取列表第0個元素，即此次標題賦給變量thistitle
                                if (title!=[]):
                                        thistitle = title[0]
                                if (content!=[]):
                                        thiscontent = content[0]
                                #將標題與內容匯總賦給變量dataall
                                dataall = "<p>標題為:"+thistitle+"</p><p>內容為："+thiscontent+"</p><br>"
                                fh.write(dataall.encode('utf-8'))
                                print("第"+str(i)+"個網頁處理")
								time.sleep(1)
                                i+=1
                        except urllib.error.URLError as e:
                                if hasattr(e,"code"):
                                        print(e.code)
                                if hasattr(e,"reason"):
                                        print(e.reason)
                                time.sleep(10)
                        except Exception as e:
                                print("exception"+str(e))
                                time.sleep(1)

                fh.close()
                html2='''</body>
                </html>
                '''
                fh=open("/home/urllib/test/1.html","ab")
                fh.write(html2.encode("utf-8"))
                fh.close()

class contrl(threading.Thread):
        def __init__(self,urlqueue):
                threading.Thread.__init__(self)
                self.urlqueue=urlqueue
        def run(self):
                while(True):
                        print("程序執行中.....")
                        time.sleep(60)
                        if(self.urlqueue.empty()):
                                print("程序執行完畢。。。")
                                exit()

key="科技"
#proxy="122.114.31.177:808"
pagestart=1
pageend=2
#listurl=getlisturl(key,pagestart,pageend)
#getcontent(listurl)
t1=getlisturl(key,pagestart,pageend,urlqueue)
#子進程設置daemon為true，保證程序正常退出
t1.setDaemon(True)
t1.start()
t2=getcontent(urlqueue)
t2.setDaemon(True)
t2.start()
t3=contrl(urlqueue)
t3.start()

執行結果正常：

技術分享圖片

瀏覽器打開1.html

技術分享圖片

已上代碼可以直接使用

python 多線程方法爬取微信公眾號文章

微信爬蟲多線程爬蟲本文在上一篇基礎上增加多線程處理（http://blog.51cto.com/superleedo/2124494 ）執行思路：1，規劃好執行流程，建立兩個執行線程，一個控制線程2，線程1用於獲取url，並寫入urlqueue隊列3，線程2，通過線程1的url獲取文章內容，並保

微信PK10平臺開發與用python爬取微信公眾號文章

網址谷歌瀏覽器 pytho google http 開發微信安裝python rom 本文通過微信提供微信PK10平臺開發[q-21528-76294] 網址diguaym.com 的公眾號文章調用接口，實現爬取公眾號文章的功能。註意事項 1.需要安裝python s

【Python爬蟲】爬取微信公眾號文章資訊準備工作

有一天發現我關注了好多微信公眾號，那時就想有沒有什麼辦法能夠將微信公眾號的文章弄下來，而且還想將一些文章的精彩評論一起搞下來。參考了一些文章，通過幾天的研究基本上實現了自己的要求，現在記錄一下自己的一些心得。整個研究過程如下： 1.瞭解微信公眾號文章連結的組成，歷史文章API組成，單個文章

用python爬取微信公眾號文章

本文通過微信提供的公眾號文章呼叫介面，實現爬取公眾號文章的功能。 # -*- coding: utf-8 -*- from selenium import webdriver import time import json import reques

記一次企業級爬蟲系統升級改造（四）：爬取微信公眾號文章（通過搜狗與新榜等第三方平臺）

首先表示抱歉，年底大家都懂的，又涉及SupportYun系統V1.0上線。故而第四篇文章來的有點晚了些~~~對關注的朋友說聲sorry! SupportYun系統當前一覽：　　首先說一下，文章的進度一直是延後於系統開發進度的。　　當前系統V1.0 已經正式上線服役了，這

輿情監控系統——step1.爬取微信公眾號文章

小明醬於2018年元旦更新，寫的還是很糙，如果你在爬蟲問題中遇到問題，歡迎交流哦，評論區隨時為你開放！實習兩週過去了，目前任務量還不是很大。我的老闆很nice，是個軍校生，給我安排的任務也比我預想的要貼近我的研究方向，做的是微信公眾號文章的輿情監控系統，以下

Python爬取微信公眾號歷史文章進行資料分析

思路： 1. 安裝代理AnProxy，在手機端安裝CA證書，啟動代理，設定手機代理； 2. 獲取目標微信公眾號的__biz; 3. 進入微信公眾號的歷史頁面； 4. 使用Monkeyrunner控制滑屏；獲取更多的歷史訊息； 5. 記錄文章標題，摘要，建立時間，創作型別，地

python爬蟲(17)爬出新高度_抓取微信公眾號文章（selenium+phantomjs）（上）

抓取微信公眾號的文章一.思路分析目前所知曉的能夠抓取的方法有： 1、微信APP中微信公眾號文章連結的直接抓取（http://mp.weixin.qq.com/s?__biz=MjM5MzU4ODk2MA==&mid=2735446906&idx=1&am

python爬蟲（17）爬出新高度_抓取微信公眾號文章（selenium+phantomjs）（下）（windows版本）

前兩天在linux 上面寫了一版爬取微信公眾號的文章今天重新修改一下，讓它在windows上面也能執行執行下面的程式碼需要安裝以下內容： pip install pyquery pip install requests pip install selenium

python使用webdriver爬取微信公眾號資訊

# -*- coding: utf-8 -*- from selenium import webdriver import time import json import requests import re import random #微信公眾號賬號 user=""

使用anyproxy+安卓模擬器自動爬取微信公眾號資料-包括閱讀數和點贊數

本文並非作者原創，本文來自 zsyoung 的CSDN 部落格，全文地址請點選：https://blog.csdn.net/zsyoung/article/details/78849982?utm_source=copy 在這裡只是把相關步驟清晰明化一下： 1.安裝node.js &n

爬取微信公眾號

1.抓取公眾號歷史記錄首先利用Fiddler4抓包，監聽手機流量 .手機電腦連線同一網路，手機需設定代理，伺服器為電腦ip，埠號為8888，fiddler也需要設定，不會百度看看點選手機公眾號“檢視歷史訊息”。注意fiddler抓的請求，第二個即為歷史訊息那個請求，

python3 scrapy爬取微信公眾號及歷史資訊V1.0

環境： python3 scrapy 目的寫這篇文章主要是做一下紀念，畢竟是搞了快兩天的東西了，今天加大了量，使用scrapy爬取100多個微信公眾號，然後出現IP被封的情況下，當然了，這種情況並不是沒有辦法解決，只需要在scr

php利用curl爬蟲爬取微信公眾號，防止ip封鎖

前段時間遇到一個需求，是定向抓取一批微信公眾號，於是找到了搜狗搜尋引擎比較好，下面貼出原始碼，各位可以試下 public function test(){ //搜狗抓取微信公眾號 $url="http://weixin.sogou.com/weixin?type=1&

python3 scrapy爬取微信公眾號及歷史資訊V2.0

程式碼部分，日後補充： # -*- coding: utf-8 -*- # @Time : 2018/2/25 14:24 # @Author : 蛇崽 # @Email : [email protected] # @File

爬取微信公眾號內容——繪製詞雲

寫在前面的話前段時間寫了一篇通過搜狗引擎獲取微信公眾號的文章，最近又看了一個網易雲歌詞繪製詞雲的程式然後我就想，能否把這兩者結合起來呢還好經歷幾多波折終於把這個東西給弄出來了。其實中間的實現不是很難，關鍵是環境搭建實在是太困難了好了，先把程式碼以及效果圖奉

pythom爬取微信公眾號最新部分文章（可執行程式碼）

執行下面的程式碼需要安裝以下內容： pip install pyquery pip install requests pip install selenium pip install pyExcelerator pip install

利用搜狗抓取微信公眾號文章

微信一直是一個自己玩的小圈子，前段時間搜狗推出的微信搜尋帶來了一絲曙光。搜狗搜尋推出了內容搜尋和公眾號搜尋兩種，利用後者可以抓取微信公眾號的最新內容，看了下還是比較及時的。每個公眾號都有一個openid，最早可以直接利用http://weixin.sogou

多線程版爬取故事網

實現 exe don comm value obj nco result nic 前言：為了能以更高效的速度爬取，嘗試采用了多線程本博客參照代碼及PROJECT來源：http://kexue.fm/archives/4385/ 源代碼： 1 #! -*- cod

多線程爬蟲爬取詳情頁HTML

切片 html rt thread set enc import req xpath 循環註意：如果想爬取詳情頁的信息請按須添加方法 import requests import os import re import threading from lxml

python 多線程方法爬取微信公眾號文章

相關推薦