Python爬取某境外網站漫畫，心血來潮，爬之

阿新 • • 發佈：2020-09-23

本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯絡我們以作處理。

轉載地址

https://blog.csdn.net/fei347795790?t=1

某年某月某日，重溫某本漫畫，心血來潮，爬之！！！

分析思路：

1. 通過靜態爬取漫畫所有章節的url連結資訊以及集數資訊。

2. 通過動態爬取每集漫畫的所有篇幅。

3. 每集漫畫儲存至相應本地資料夾中。

一、獲取所有漫畫資訊

通過對漫畫網頁地址load = "http://x.seeym.net/xxxx/xxxxxxxx/"分析，每集漫畫的連結和集數集中於某個標籤中，通過requests庫和BeautifulSoup庫獲取該標籤資訊。並轉化成pd.DataFrame格式資料，共包含三列ID（集數），url（漫畫連結），name（集名）。.

import os
import numpy as npimport pandas as pdimport requestsimport reimport timefrom bs4 import BeautifulSoup
from selenium import webdriver  
#載入漫畫連結,返回DataFrame檔案,包含ID,URL連結,檔名.def url(load):    try:        bb = requests.get(load, timeout = 30)
        bb.encoding = bb.apparent_encoding        soup = BeautifulSoup(bb.text,'lxml')
        div = soup.find(id = 'mh-chapter-list-ol-0')
        url = []        name = []        for i in div.find_all("a"): 
            url.append("http://m.seeym.net" + i.attrs["href"])
            b = re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])","",i.string)
            name.append(b)
        ID = np.array(list(reversed(range(len(url))))).reshape(len(url),1) + 1
        url = np.array(url).reshape(len(url),1)
        name = np.array(name).reshape(len(url),1)
        info = pd.DataFrame(np.concatenate((ID,url,name),axis=1),columns=["ID","url","name"])
        info["ID"] = info["ID"].apply(pd.to_numeric)
        info = info.sort_values(by = "ID")
        return info
    except:        print("網頁不可用")

二、下載漫畫

url連結分析：具體漫畫資訊隱藏在動態連結中，通過selenium爬取圖片資訊和漫畫圖片數，每張圖片的url連結為http://xxxxxx.html?p=i，i為漫畫圖片數。第一步，獲得該集漫畫圖片總數。第二步，迴圈爬取漫畫圖片。第三步，下載儲存。

#提取漫畫內容load:載入漫畫網址,path:儲存地址,dic:儲存集數
def jpg(load,path,dic,browser,ID):
    a = 1
    try:        browser.get(load)
        elemt = browser.find_element_by_id("k_total") 
        total = elemt.text        for i in range(1,int(total)+1):
            load1 = load + '?p=' + str(i)
            browser.get(load1)            elemt = browser.find_element_by_id("qTcms_pic")
            bb1 = elemt.get_attribute('src') 
            bb2 = requests.get(bb1, timeout = 30)
            bb2.encoding = bb2.apparent_encoding            path1 = path + '/' + str(ID) + '_' + dic + '/' +  str(i) + '.jpg'
            with open(path1, 'wb') as f:
                f.write(bb2.content)
            time.sleep(np.random.randint(1,3))
        print('xxxxxx{}下載成功'.format(dic))
        return a - 1
    except:        print('xxxxxx{}下載失敗'.format(dic))
        return a

程式碼修改：下載過程中出現漫畫集名有特殊字元的情況，通過正則表示式將特殊字元剔除。

三、新建儲存資料夾和刪除資料夾

更新漫畫操作，若本地資料夾中無該漫畫集數子資料夾，則建立資料夾。若有，則不建立。刪除下載失敗的資料夾，若不能下載完整漫畫集，則刪除建立的資料夾。

#新建資料夾
def mkdir(path,dic,ID):
    path = path + '/' + str(ID) + '_' + dic
    check = 1
    if os.path.exists(path) == False:
        print('建立{}資料夾'.format(dic))
        os.mkdir(path)
        return check - 1
    else:
        return check
#刪除資料夾，檔案def deldir(path,dic,ID):
    path = path + '/' + str(ID) + '_' + dic
    if os.path.exists(path) == True:
        for i in os.listdir(path):
            path1 = path + '/' + i
            os.remove(path1)
        os.rmdir(path)
    else:
        pass

四、拼接程式碼

下載流程：

1. 輸入load連結，本地地址，執行webdriver.Chrome谷歌瀏覽器。

2. 獲取漫畫url資訊。

3. 迴圈下載漫畫，生成下載失敗檔案fail。

4. 新建子檔案下，下載漫畫，並time.sleep(np.random.randint(x))模擬人看漫畫時長。

def main():
    #第一步：載入xxxxx網址和儲存漫畫地址    browser = webdriver.Chrome(executable_path = "D:/chromedriver.exe")
    load = "http://xxxxx.net/xxxx/xxxxx/"
    path = "F:/漫畫/xxxxx"
    #第二步，獲取所有漫畫的url資訊    info = url(load)
    #第三步，下載漫畫    fail = info.iloc[0:1,:]
    for i in range(len(info)):
        http = info.iloc[i,1]
        dic = info.iloc[i,2]
        ID = info.iloc[i,0]
        check = mkdir(path,dic,ID)
        if check == 0:
            a = jpg(http,path,dic,browser,ID)
            if a == 1:
                fail = pd.concat((fail,info.iloc[i:i+1,:]))
                deldir(path,dic,ID)
            else:
                time.sleep(np.random.randint(3,6))
                continue        else:
            continue    return fail
fail = main()

五、結果

PS：網址來源於某盜版網址，圖文中相關連結等資訊均隱藏。

Python爬取某境外網站漫畫，心血來潮，爬之

Python爬取某境外網站漫畫，心血來潮，爬之

python爬取某音小姐姐短視訊，今天帶你全自動下載！

python爬蟲：爬取某牙直播小姐姐圖片，我的雙手已經按捺不住了

記錄python爬取某程式設計網站內容

爬取某知名網站的資料

Python爬蟲入門練手案例，爬取某乎問答數（附原始碼）

用Python爬取某蔬菜網的行情，分析底哪個地區的蔬菜便宜

Python實現JS解密並爬取某音漫客網站

python爬蟲- 爬取幽默笑話網站，帶你一起笑翻天

Python 爬取某音某皮某博個關於’清華學姐‘事件網友對待這個態度，個10w評論

Python爬取某東羽絨服資料，用視覺化幫你挑選心儀的衣服

Python爬蟲：輸入公司名稱，爬取企查查網站中的公司資訊

Python爬蟲進階之爬取某視訊並下載，沒有廣告的視訊看起來不爽嗎？

python協程爬取某網站的老賴資料

實戰爬取某網站圖片-Python

手把手教你用Python爬取某網小說資料，並進行視覺化分析

Python爬蟲 scrapy框架爬取某招聘網存入mongodb解析

python爬蟲爬取幽默笑話網站

實用python爬取妹子圖網站圖片

python爬取某查查用selenium終於搞定了！

Python爬取某境外網站漫畫，心血來潮，爬之

相關推薦