python 爬蟲封裝自己的常用方法

阿新 • • 發佈：2019-01-03

import urllib
import urllib.request
import ssl
import re
from collections import deque

def writeFile2Strs(url,topath):
    with open(topath,"w") as f:
        f.write(getHtmlBytes(url).decode("utf-8"))

def writeFile2Bytes(url,topath):
    with open(topath,"wb") as f:
        f.write(getHtmlBytes(url))

def getHtml_Str(url,decode="utf-8"):
    return getHtmlBytes(url).decode(decode)

def getURL_list(strs):
    parUrl = r"(((http|ftp|https)://)(([a-zA-Z0-9\._-]+\.[a-zA-Z]{2,6})|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(:[0-9]{1,4})*(/[a-zA-Z0-9\&%_\./-~-]*)?)"
    re_URL = re.compile(parUrl)
    listURL = list(set(re_URL.findall(strs))) #這裡的listURL中每個元素都是又個一個列表
    listURLs = []
    for URLi in listURL: #取每個元素列表的[0]
        listURLs.append(URLi[0])
    return listURLs
def getQQ_list(strs):
    pat = r"[1-9]\d{4,10}"
    re_pat = re.compile(pat)
    listQQ = re_pat.findall(strs)
    listQQ = list(set(listQQ))
    return  listQQ

def proceedAllUrlList(url,urlProceed):
    #篩選所有符合的列表
    dq = deque()
    dq.append(url)
    
    while len(dq)!= 0:
        targeturl = dq.popleft()
        urlList = getURL_list(getHtml_Str(url))
        urlProceed(url)
        for oneURL in urlList:
            dq.append(oneURL)

def getHtmlBytes(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"}
    req = urllib.request.Request(url, headers=headers)

    # 使用ssl建立未驗證的上下文
    context = ssl._create_unverified_context()
    try:
        response = urllib.request.urlopen(req, timeout=5,context=context)
    except:
        print("爬取超時！關閉執行緒")
        return -1
    return  response.read()

python 爬蟲封裝自己的常用方法

import urllib import urllib.request import ssl import re from collections import deque def writeFile2Strs(url,topath): with open(topath,"w") as f

python列表的一些常用方法以及函數

每一個反向 text 插入 pop 常用 ever 二次默認學習到了一些關於python列表的新知識，自己整理了一下，方便大家參考： #!/usr/bin/env python # _*_ coding:utf-8 _*_ # File_type:列表的常用操作

Python 字符串常用方法總結

style sdi 括號 16px 空格精度意思字符 dst 明確：對字符串的操作方法都不會改變原來字符串的值 1，去掉空格和特殊符號 name.strip() 去掉空格和換行符 name.strip(‘xx‘) 去掉某個字符串 name.lstrip() 去掉

python os模塊常用方法總結

rmdir src dst 系統信息 nbsp isf pre os模塊 text 該模塊提供一種便捷的方式來操作系統 os.environ：返回系統環境變量 os.getenv(env)：返回環境變量env的值 os.getpid()：當前程序的進程 os.uname(

python-字符串常用方法

mat 小寫字母查詢 pytho 掌握 abc 替換字符 int TE 字符串常用方法 1、以下方法了解即可內存中的數據不能持久化保存，重啟或者關機都會清除內存中的數據 print(name.capitalize()) #把字符串首字母大寫 print(name.cen

python字符串常用方法

小寫 span ood isl tle print -- 九九 PE 1、isalnum():判斷字符串所有的字符都是字母或者數字。返回true和false In [1]: str1=‘jiangwei520‘ In [2]: str2=‘jiang wei‘ I

python中logging的常用方法

存在日誌輪轉 val 設定 href lee count ftime 輸出 logging常用 # -*- coding:utf-8 -*- __author__ = "lgj" import os import sys import time import loggi

Python：os 模組常用方法簡介

os.getcwd()# 返回當前工作目錄 os.path.abspath(path)# 返回 path 的絕對路徑# os.path.abspath('.') 相當於 os.getcwd() os.path.split(path)# 返回 tuple(頭部, 尾部)，尾部是最終斜線後的所有內容# 一般用

python爬蟲三大解析資料方法：bs4 及爬小說網案例

bs4 python獨有可以將html文件轉成bs物件，可以直接呼叫bs物件的屬性進行解析安裝 pip install bs4 本地html Beautiful(“open(‘路徑’)”,‘lxml’) 網路html Beautiful

python爬蟲三大解析資料方法：正則及圖片下載案例

基本正則用法回顧 # 提取python key = 'javapythonc++php' print(re.findall('python', key)[0]) # 提取hello world key = '<html><h1>hello world</h

python xlrd讀取excel常用方法

轉自：https://www.cnblogs.com/feiyueNotes/p/7786579.html 最近學習了python操作excel，記錄下常用方法. 需要安裝xlrd模組，開啟cmd，輸入命令：pip install xlrd 進行安裝，若已安裝顯示如下

python筆記(封裝(含類方法和靜態方法))

一、封裝： 1、廣義上面向物件的封裝：程式碼的保護，面向物件的思想本身就是一種封裝 2、只讓自己的物件能呼叫自己類的方法 3、狹義上的封裝 – 面向物件三大特性 4、屬性和方法都藏起來，不讓你看見 class Person: def __init__(self,name,pass

Python文件讀取常用方法---待編輯

wid 功能 font 移動文件自動指定 spa adl 讀取文件 1. 關於讀取文件 f.read() 讀取文件中所有內容 f.readline() 讀取第一行的內容 f.readlines() 讀取文件裏面所有內容，把每行的內容放到一個list裏面

Python----字符串常用方法總結

www. str 分割符號 split() 取字符 -- num int 字符串可以存任意類型的字符串，比如字母，名字，一句話等等。 name = ‘python‘ tag = ‘Welcome to china!‘ 字符串還有很多內置的方法，對字符串進行操作，常用

.os.path.abspath(path)、os.path.dirname(path)、os.path.basename(path)等等關於python os.path模組常用方法詳解

裡面包含.os.path.abspath(path)、os.path.split(path)、os.path.dirname(path)、os.path.basename(path)、os.path.commonprefix(list)、os.path.exists(path)、os.path.is

python基礎--字串的常用方法和列表

字串的常用方法判斷字串變成‘標題’ In [1]: 'Hello'.istitle() Out[1]: True In [2]: 'hello'.istitle() Out[2]: False In [7]: 'heLLo'.islower() ##判斷是否全部為小寫 Out[

python中for迴圈常用方法

#【1】遍歷列表 languages=["c","c++","python","shell"] for x in languages: print(x) #【2】使用內建range()函式遍歷數字序列 for j in range(5):

python 字符串常用方法

空字符 strip() 組成 cas 常用方法表示 orm msg 報錯一、基礎數據類型總覽 int：用於計算，計數，運算等。 1，2，3，100...... str：‘這些內容[]‘ 用戶少量數據的存儲，便於操作。 bool： True

python os.path模組常用方法詳解

1.os.path.abspath(path) 返回path規範化的絕對路徑。 >>> os.path.abspath('test.csv') 'C:\\Python25\\test.csv' >>> os.path

python爬蟲中文亂碼解決方法

python爬蟲中文亂碼前幾天用python來爬取全國行政區劃編碼的時候，遇到了中文亂碼的問題，折騰了一會兒，才解決。現特記錄一下，方便以後檢視。我是用python的requests和bs4庫來實現爬蟲，這兩個庫的簡單用法可參照python爬取噹噹網的書籍資訊並儲存到csv檔案亂碼未處理前部分程式碼

python 爬蟲 封裝自己的常用方法

相關推薦

python 爬蟲封裝自己的常用方法