使用Python 爬取京東，淘寶。商品詳情頁的資料。（避開了反爬蟲機制）

阿新 • • 發佈：2022-01-10

以下是爬取京東商品詳情的Python3程式碼，以excel存放連結的方式批量爬取。excel如下

程式碼如下

from selenium import webdriver
from lxml import etree
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import datetime
import calendar
import logging
from logging import handlers
import requests
import os
import time
import pymssql
import openpyxl
import xlrd
import codecs



 
class EgongYePing:

     options = webdriver.FirefoxOptions()
     fp = webdriver.FirefoxProfile()
     fp.set_preference("browser.download.folderList",2)  
     fp.set_preference("browser.download.manager.showWhenStarting",False)
     fp.set_preference("browser.helperApps.neverAsk.saveToDisk","application/zip,application/octet-stream 
")
     global driver 
     driver= webdriver.Firefox(firefox_profile=fp,options=options)
     def Init(self,url,code):
                       print(url.strip())
                       driver.get(url.strip())
                       #driver.refresh()
                       # 操作瀏覽器屬於非同步，在網路出現問題的時候。可能程式碼先執行。但是請求頁面沒有應答。所以硬等
                       time.sleep( 
int(3))
                       html = etree.HTML(driver.page_source) 
                       if driver.title!=None:
                         listImg=html.xpath('//*[contains(@class,"spec-list")]//ul//li//img')
                         if len(listImg)==0:
                              pass
                         if len(listImg)>0:
                                            imgSrc=''
                                            for item in range(len(listImg)):    
                                                 imgSrc='https://img14.360buyimg.com/n0/'+listImg[item].attrib["data-url"]
                                                 print('頭圖下載:'+imgSrc)
                                                 try:
                                                  Headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
                                                  r = requests.get(imgSrc, headers=Headers, stream=True)
                                                  if r.status_code == 200:
                                                     imgUrl=''
                                                     if item==0:
                                                          imgUrl+=code + "_主圖_" + str(item)  + '.' + imgSrc.split('//')[1].split('/')[len(imgSrc.split('//')[1].split('/'))-1].split('.')[1]
                                                     else:
                                                          imgUrl+=code + "_附圖_" + str(item)  + '.' + imgSrc.split('//')[1].split('/')[len(imgSrc.split('//')[1].split('/'))-1].split('.')[1]
                                                     open(os.getcwd()+'/img/'+  imgUrl , 'wb').write(r.content) # 將內容寫入圖片
                                                  del r
                                                 except Exception as e:
                                                    print("圖片禁止訪問:"+imgSrc) 
                         listImg=html.xpath('//*[contains(@class,"ssd-module")]') 
                         if len(listImg)==0:
                              listImg=html.xpath('//*[contains(@id,"J-detail-content")]//div//div//p//img')
                         if len(listImg)==0:
                              listImg=html.xpath('//*[contains(@id,"J-detail-content")]//img')
                         if len(listImg)>0:
                               for index in range(len(listImg)):  
                                    detailsHTML=listImg[index].attrib
                                    if 'data-id' in detailsHTML:
                                          try:
                                           details= driver.find_element_by_class_name("animate-"+listImg[index].attrib['data-id']).value_of_css_property('background-image')
                                           details=details.replace('url(' , ' ')
                                           details=details.replace(')' , ' ')
                                           newDetails=details.replace('"', ' ')
                                           details=newDetails.strip()
                                           print("詳情圖下載："+details)
                                           try:
                                                  Headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
                                                  r = requests.get(details, headers=Headers, stream=True)
                                                  if r.status_code == 200:
                                                     imgUrl=''
                                                     imgUrl+=code + "_詳情圖_" + str(index)  + '.' + details.split('//')[1].split('/')[len(details.split('//')[1].split('/'))-1].split('.')[1]
                                                     open(os.getcwd()+'/img/'+   imgUrl, 'wb').write(r.content) # 將內容寫入圖片
                                                  del r
                                           except Exception as e:
                                                    print("圖片禁止訪問:"+details) 
                                          except Exception as e:      
                                               print('其他格式的圖片不收錄');       
                                    if  'src' in detailsHTML:
                                         try:
                                           details= listImg[index].attrib['src']
                                           if 'http' in details:
                                                     pass
                                           else:
                                                     details='https:'+details
                                           print("詳情圖下載："+details)
                                           try:
                                                  Headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
                                                  r = requests.get(details, headers=Headers, stream=True)
                                                  if r.status_code == 200:
                                                     imgUrl=''
                                                     imgUrl+=code + "_詳情圖_" + str(index)  + '.' + details.split('//')[1].split('/')[len(details.split('//')[1].split('/'))-1].split('.')[1]
                                                     open(os.getcwd()+'/img/'+   imgUrl, 'wb').write(r.content) # 將內容寫入圖片
                                                  del r
                                           except Exception as e:
                                                    print("圖片禁止訪問:"+details) 
                                         except Exception as e:      
                                               print('其他格式的圖片不收錄'); 

                       print('結束執行')

         

     @staticmethod
     def readxlsx(inputText):
        filename=inputText
        inwb = openpyxl.load_workbook(filename)  # 讀檔案
        sheetnames = inwb.get_sheet_names()  # 獲取讀檔案中所有的sheet，通過名字的方式
        ws = inwb.get_sheet_by_name(sheetnames[0])  # 獲取第一個sheet內容
        # 獲取sheet的最大行數和列數
        rows = ws.max_row
        cols = ws.max_column
        for r in range(1,rows+1):
            for c in range(1,cols):
                if ws.cell(r,c).value!=None and r!=1 :
                 if 'item.jd.com' in str(ws.cell(r,c+1).value) and str(ws.cell(r,c+1).value).find('i-item.jd.com')==-1:
                     print('支援:'+str(ws.cell(r,c).value)+'|'+str(ws.cell(r,c+1).value))
                     EgongYePing().Init(str(ws.cell(r,c+1).value),str(ws.cell(r,c).value))
                 else:
                     print('當前格式不支援:'+(str(ws.cell(r,c).value)+'|'+str(ws.cell(r,c+1).value)))
                     pass
        pass

if __name__ == "__main__":
                 start = EgongYePing()
                 start.readxlsx(r'C:\Users\newYear\Desktop\爬圖.xlsx')

基本上除了過期的商品無法訪問以外。對於京東的三種頁面結構都做了處理。能訪問到的商品頁面。還做了模擬瀏覽器請求訪問和下載。基本不會被反爬蟲遮蔽下載。

上面這一段是以火狐模擬器執行

上面這一段是模擬瀏覽器下載。如果不加上這一段。經常會下載幾十張圖片後，很長一段時間無法正常下載圖片。因為沒有請求頭被認為是爬蟲。

上面這段是京東的商品詳情頁面，經常會三種？（可能以後會更多的頁面結構）

所以做了三段解析。只要沒有抓到圖片就換一種解析方式。這楊就全了。

京東的圖片基本只存/1.jpg。然後域名是https://img14.360buyimg.com/n0/。所以目前要拼一下。

京東還有個很蛋疼的地方是圖片以data-id拼進div的背景元素裡。所以取出來的時候要繞一下。還好也解決了。

以下是爬取京東商品詳情的Python3程式碼，以excel存放連結的方式批量爬取。excel如下

因為這次是淘寶和京東一起爬取。所以在一個excel裡。程式碼裡區分淘寶和京東的連結。以下是程式碼

from selenium import webdriver
from lxml import etree
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import datetime
import calendar
import logging
from logging import handlers
import requests
import os
import time
import pymssql
import openpyxl
import xlrd
import codecs



class EgongYePing:

     options = webdriver.FirefoxOptions()
     fp = webdriver.FirefoxProfile()
     fp.set_preference("browser.download.folderList",2)  
     fp.set_preference("browser.download.manager.showWhenStarting",False)
     fp.set_preference("browser.helperApps.neverAsk.saveToDisk","application/zip,application/octet-stream")
     global driver 
     driver= webdriver.Firefox(firefox_profile=fp,options=options)
     def Init(self,url,code):
                       #driver = webdriver.Chrome('D:\python3\Scripts\chromedriver.exe')
                       #driver.get(url)
                       print(url.strip())
                       driver.get(url.strip())
                       #driver.refresh()
                       # 操作瀏覽器屬於非同步，在網路出現問題的時候。可能程式碼先執行。但是請求頁面沒有應答。所以硬等
                       time.sleep(int(3))
                       html = etree.HTML(driver.page_source) 
                       if driver.title!=None:
                         listImg=html.xpath('//*[contains(@id,"J_UlThumb")]//img')
                         if len(listImg)==0:
                              pass
                         if len(listImg)>0:
                                            imgSrc=''
                                            for item in range(len(listImg)):    
                                                 search=listImg[item].attrib
                                                 if 'data-src' in search:
                                                    imgSrc=listImg[item].attrib["data-src"].replace('.jpg_50x50','')
                                                 else:
                                                    imgSrc=listImg[item].attrib["src"]
                                                 if 'http' in imgSrc:
                                                     pass
                                                 else:
                                                     imgSrc='https:'+imgSrc
                                                 print('頭圖下載:'+imgSrc)
                                                 try:
                                                  Headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
                                                  r = requests.get(imgSrc, headers=Headers, stream=True)
                                                  if r.status_code == 200:
                                                     imgUrl=''
                                                     if item==0:
                                                          imgUrl+=code + "_主圖_" + str(item)  + '.' + imgSrc.split('//')[1].split('/')[len(imgSrc.split('//')[1].split('/'))-1].split('.')[1]
                                                     else:
                                                          imgUrl+=code + "_附圖_" + str(item)  + '.' + imgSrc.split('//')[1].split('/')[len(imgSrc.split('//')[1].split('/'))-1].split('.')[1]
                                                     open(os.getcwd()+'/img/'+  imgUrl , 'wb').write(r.content) # 將內容寫入圖片
                                                  del r
                                                 except Exception as e:
                                                    print("圖片禁止訪問:"+imgSrc) 
                         listImg=html.xpath('//*[contains(@id,"J_DivItemDesc")]//img')
                         if len(listImg)>0:
                               for index in range(len(listImg)):  
                                    detailsHTML=listImg[index].attrib
                                    if 'data-ks-lazyload' in detailsHTML:
                                        details= listImg[index].attrib["data-ks-lazyload"]
                                        print("詳情圖下載："+details)
                                    else:
                                        details= listImg[index].attrib["src"]
                                        print("詳情圖下載："+details)
                                    try:
                                                  Headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
                                                  r = requests.get(details, headers=Headers, stream=True)
                                                  if r.status_code == 200:
                                                     imgUrl=''
                                                     details=details.split('?')[0]
                                                     imgUrl+=code + "_詳情圖_" + str(index)  + '.' + details.split('//')[1].split('/')[len(details.split('//')[1].split('/'))-1].split('.')[1]
                                                     open(os.getcwd()+'/img/'+   imgUrl, 'wb').write(r.content) # 將內容寫入圖片
                                                  del r
                                    except Exception as e:
                                                    print("圖片禁止訪問:"+details)  
                       print('結束執行')

         

     @staticmethod
     def readxlsx(inputText):
        filename=inputText
        inwb = openpyxl.load_workbook(filename)  # 讀檔案
        sheetnames = inwb.get_sheet_names()  # 獲取讀檔案中所有的sheet，通過名字的方式
        ws = inwb.get_sheet_by_name(sheetnames[0])  # 獲取第一個sheet內容
        # 獲取sheet的最大行數和列數
        rows = ws.max_row
        cols = ws.max_column
        for r in range(1,rows+1):
            for c in range(1,cols):
                if ws.cell(r,c).value!=None and r!=1 :
                 if 'item.taobao.com' in str(ws.cell(r,c+1).value):
                     print('支援:'+str(ws.cell(r,c).value)+'|'+str(ws.cell(r,c+1).value))
                     EgongYePing().Init(str(ws.cell(r,c+1).value),str(ws.cell(r,c).value))
                 else:
                     print('當前格式不支援:'+(str(ws.cell(r,c).value)+'|'+str(ws.cell(r,c+1).value)))
                     pass
        pass

if __name__ == "__main__":
                 start = EgongYePing()
                 start.readxlsx(r'C:\Users\newYear\Desktop\爬圖.xlsx')

淘寶有兩個問題，一個是需要繫結賬號登入訪問。這裡是程式碼斷點。然後手動走過授權。

第二個是被休息和懶惰載入。被休息。其實沒影響的。一個頁面結構已經加載出來了。然後也不會影響訪問其他的頁面。

至於懶惰載入嘛。對我們也沒啥影響。如果不是直接寫在src裡那就在判斷一次取data-ks-lazyload就出來了。

最後就是爬取的片段截圖

建議還是直接將爬取的資料存伺服器，資料庫，或者圖片伺服器。因為程式挺靠譜的。一萬條資料。爬了26個G的檔案。最後上傳的時候差點累死了

是真的大。最後還要拆包。十幾個2g壓縮包一個一個上傳。才成功。

使用Python 爬取京東，淘寶。商品詳情頁的資料。（避開了反爬蟲機制）

以下是爬取京東商品詳情的Python3程式碼，以excel存放連結的方式批量爬取。excel如下

看看最近京東哪些產品最火，Python爬取京東的商品排行

確立需求目標之所以寫爬蟲，肯定是有需求才會寫，不然就沒啥意義了。我們今天這個爬蟲主要的任務就是，輸入一個關鍵字，然後將京東返回的商品結果按一定的條件取得前十的商品名稱和價格。知道了要幹什麼，就開始上乾

Python爬取B站十週年特輯視訊彈幕資料，並繪製生成詞雲。（附原始碼）

前言今天用“Running Man”十週年特輯的視訊，來做個獲取彈幕的案例分享給大家，直接開整~

Python爬取京東商品使用者的評價

一、爬取京東商品手機的使用者評價，包括評價、顏色、手機型號並存入資料庫（MySQL）

基於Python爬取京東雙十一商品價格曲線

一年一度的雙十一就快到了，各種砍價、蓋樓、挖現金的口令將在未來一個月內充斥朋友圈、微信群中。玩過多次雙十一活動的小編表示一頓操作猛如虎，一看結果2毛5。浪費時間不說而且未必得到真正的優惠，雙十一電商的“

利用Python爬取京東商品的一種辦法

前言如今的京東、淘寶、天貓等等已經不同往日了, 在使用者不登入的情況下, 很難通過技術手段來大規模獲取到我們關注的商品資訊. 關於京東等購物網站的自動登入也有很多人在做, 但是大廠的反爬能力確實很強, 目

python爬取京東商品評論

可爬取的內容上程式碼 import requests import json import csv from lxml import etree from bs4 import BeautifulSoup

Python爬取京東手機評論資訊

程式碼如下： 1 # coding=\'utf-8\' 2 import requests 3 import json 4 import time 5 import random 6 import xlwt

python爬取京東評論

一.分析 1.找到京東商品評論所在位置(記得點選商品評論，否則找不到productPageComments.action)

阿里官方自營，淘寶心選負離子電吹風 39 元（減 60 元）

阿里官方自營，淘寶心選負離子電吹風雙 11 狂歡價 99 元，下單立減 10 元，限時限量 50 元券，實付 39 元包郵，領券併購買。吊牌價 129 元，相當於 3 折優惠。使用最會買 App 下單，預計還能再返 4.29 元，返後 34.7

python爬蟲爬取淘寶商品比價(附淘寶反爬蟲機制解決小辦法)

因為評論有很多人說爬取不到，我強調幾點 kv的格式應該是這樣的： kv = {‘cookie\':‘你複製的一長串cookie\',‘user-agent\':‘Mozilla/5.0\'}

一篇文章教會你用Python爬取淘寶評論資料（寫在記事本）

【一、專案簡介】本文主要目標是採集淘寶的評價，找出客戶所需要的功能。統計客戶評價上面誇哪個功能多，比如防水，容量大，好看等等。

Python爬取淘寶商品資訊寫入mysql

直接上程式碼：（商品名稱、單價、圖片連結） import pymysql import requests import re

Python selenium庫爬取淘寶網商品資訊

重大跟新：https://blog.csdn.net/pineapple_C/article/details/108181761post模擬登入淘寶並爬取商品列表

用 Python 爬取網易嚴選妹子內衣資訊，探究妹紙們的偏好

今天繼續來分析爬蟲資料分析文章，一起來看看網易嚴選商品評論的獲取和分析。

Python如何使用正則表示式爬取京東商品資訊

京東（JD.com）是中國最大的自營式電商企業，2015年第一季度在中國自營式B2C電商市場的佔有率為56.3%。如此龐大的一個電商網站，上面的商品資訊是海量的，小編今天就帶小夥伴利用正則表示式，並且基於輸入的關鍵詞來

Python基於BeautifulSoup爬取京東商品資訊

今天小編利用美麗的湯來為大家演示一下如何實現京東商品資訊的精準匹配~~

Python利用Xpath選擇器爬取京東網商品資訊

HTML檔案其實就是由一組尖括號構成的標籤組織起來的，每一對尖括號形式一個標籤，標籤之間存在上下關係，形成標籤樹；XPath 使用路徑表示式在 XML 文件中選取節點。節點是通過沿著路徑或者 step 來選取的。

Python CSS選擇器爬取京東網商品資訊過程解析

CSS選擇器目前，除了官方文件之外，市面上及網路詳細介紹BeautifulSoup使用的技術書籍和部落格軟文並不多，而在這僅有的資料中介紹CSS選擇器的少之又少。在網路爬蟲的頁面解析中，CCS選擇器實際上是一把效率甚高的利

用Python爬取28010條《隱祕的角落》評論，有沒發現點什麼？

“一起去爬山吧？” 這句臺詞火爆了整個朋友圈，沒錯，就是來自最近熱門的《隱祕的角落》，豆瓣評分8.9分，好評不斷。

使用Python 爬取 京東 ，淘寶。 商品詳情頁的資料。（避開了反爬蟲機制）

相關推薦

使用Python 爬取京東，淘寶。商品詳情頁的資料。（避開了反爬蟲機制）