抓取某東的TT購買記錄分析TT購買趨勢

阿新 • • 發佈：2019-01-10

最近學習了一些爬蟲技術，想做個小專案檢驗下自己的學習成果，在逛某東的時候，突然給我推薦一個TT的產品，點選進去瀏覽一番之後就產生了抓取TT產品，然後進行資料分析，看下那個品牌的TT賣得最好。

本文通過selenium抓取TT資訊，存入到mongodb資料庫中。

抓取TT產品資訊

TT產品頁面的連線是https://list.jd.com/list.html?cat=9192,9196,1502&page=1&sort=sort_totalsales15_desc&trans=1&JL=6_0_0#J_main。
上面有個page引數，表示第幾頁。改變這個引數就可以爬取到不同頁面的TT產品。

通過開發者工具看下如果抓取TT的產品資訊，例如名字、品牌、價格、評論數量等。
condom
通過上圖可以看到一個TT產品資訊對應的原始碼是一個class為gl-item的li節點<li class='gl-item'>。li節點中data-sku屬性是產品的ID，後面抓取產品的評論資訊會用到，brand_id是品牌ID。class為p-price的div節點對應的是TT產品的價格資訊。class為p-comment的div節點對應的是評論總數資訊。

開始使用requests是總是無法解析到TT的價格和評論資訊，最後適應selenium才解決了這個問題，如果有人知道怎麼解決這問題，望不吝賜教。

下面介紹抓取TT產品評論資訊。

點選一個TT產品，會跳轉到產品詳細頁面，點選“商品評論”，然後勾選上“只看當前商品評價”選項（如果不勾選，就會看到該系列產品的評價）就會看到商品評論資訊，我們用開發者工具看下如果抓取評論資訊。
condom
如上圖所示，在開發者工具中，點選Network選項，就會看到“https://club.jd.com/discussion/getSkuProductPageImageCommentList.action?productId=3521615&isShadowSku=0&callback=jQuery6014001&page=2&pageSize=10&_=1547042223100

” 的連結，這個連結返回的是json資料。其中productId就是TT產品頁面的data-sku屬性的資料。page引數是第幾頁評論。返回的json資料中，content是評論數，createTime是下單時間。

程式碼如下：


def parse_product(page,html):
    doc = pq(html)
    li_list = doc('.gl-item').items()
    for li in li_list:
        product_id = li('.gl-i-wrap').attr('data-sku')
        brand_id = li('.gl-i-wrap').attr('brand_id')
        time.sleep(get_random_time())
        title = li('.p-name').find('em').text()
        price_items = li('.p-price').find('.J_price').find('i').items()
        price = 0
        for price_item in price_items:
            price = price_item.text()
            break
        total_comment_num = li('.p-commit').find('strong a').text()
        if total_comment_num.endswith("萬+"):
            print('總評價數量：' + total_comment_num)
            total_comment_num = str(int(float(total_comment_num[0:len(total_comment_num) -2]) * 10000))
            print('轉換後總評價數量：' + total_comment_num)
        elif total_comment_num.endswith("+"):
            total_comment_num = total_comment_num[0:len(total_comment_num) - 1]
        condom = {}
        condom["product_id"] = product_id
        condom["brand_id"] = brand_id
        condom["condom_name"] = title
        condom["total_comment_num"] = total_comment_num
        condom["price"] = price
        comment_url = 'https://club.jd.com/comment/skuProductPageComments.action?callback=fetchJSON_comment98vv117396&productId=%s&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1'
        comment_url = comment_url %(product_id)
        response = requests.get(comment_url,headers = headers)
        if response.text == '':
            for i in range(0,10):
                time.sleep(get_random_time())
                try:
                    response = requests.get(comment_url, headers=headers)
                except requests.exceptions.ProxyError:
                    time.sleep(get_random_time())
                    response = requests.get(comment_url, headers=headers)
                if response.text:
                    break
                else:
                    continue
        text = response.text
        text = text[28:len(text) - 2]
        jsons = json.loads(text)
        productCommentSummary = jsons.get('productCommentSummary')
        # productCommentSummary = response.json().get('productCommentSummary')
        poor_count = productCommentSummary.get('poorCount')
        general_count = productCommentSummary.get('generalCount')
        good_count = productCommentSummary.get('goodCount')
        comment_count = productCommentSummary.get('commentCount')
        poor_rate = productCommentSummary.get('poorRate')
        good_rate = productCommentSummary.get('goodRate')
        general_rate = productCommentSummary.get('generalRate')
        default_good_count = productCommentSummary.get('defaultGoodCount')
        condom["poor_count"] = poor_count
        condom["general_count"] = general_count
        condom["good_count"] = good_count
        condom["comment_count"] = comment_count
        condom["poor_rate"] = poor_rate
        condom["good_rate"] = good_rate
        condom["general_rate"] = general_rate
        condom["default_good_count"] = default_good_count
        collection.insert(condom)

        comments = jsons.get('comments')
        if comments:
            for comment in comments:
                print('解析評論')
                condom_comment = {}
                reference_time = comment.get('referenceTime')
                content = comment.get('content')
                product_color = comment.get('productColor')
                user_client_show = comment.get('userClientShow')
                user_level_name = comment.get('userLevelName')
                is_mobile = comment.get('isMobile')
                creation_time = comment.get('creationTime')
                guid = comment.get("guid")
                condom_comment["reference_time"] = reference_time
                condom_comment["content"] = content
                condom_comment["product_color"] = product_color
                condom_comment["user_client_show"] = user_client_show
                condom_comment["user_level_name"] = user_level_name
                condom_comment["is_mobile"] = is_mobile
                condom_comment["creation_time"] = creation_time
                condom_comment["guid"] = guid
                collection_comment.insert(condom_comment)
        parse_comment(product_id)


def parse_comment(product_id):
    comment_url = 'https://club.jd.com/comment/skuProductPageComments.action?callback=fetchJSON_comment98vv117396&productId=%s&score=0&sortType=5&page=%d&pageSize=10&isShadowSku=0&fold=1'
    for i in range(1,200):
        time.sleep(get_random_time())
        time.sleep(get_random_time())
        print('抓取第' + str(i) + '頁評論')
        url = comment_url%(product_id,i)
        response = requests.get(url, headers=headers,timeout=10)
        print(response.status_code)
        if response.text == '':
            for i in range(0,10):
                print('抓取不到資料')
                response = requests.get(comment_url, headers=headers)
                if response.text:
                    break
                else:
                    continue
        text = response.text
        print(text)
        text = text[28:len(text) - 2]
        print(text)
        jsons = json.loads(text)
        comments = jsons.get('comments')
        if comments:
            for comment in comments:
                print('解析評論')
                condom_comment = {}
                reference_time = comment.get('referenceTime')
                content = comment.get('content')
                product_color = comment.get('productColor')
                user_client_show = comment.get('userClientShow')
                user_level_name = comment.get('userLevelName')
                is_mobile = comment.get('isMobile')
                creation_time = comment.get('creationTime')
                guid = comment.get("guid")
                id = comment.get("id")
                condom_comment["reference_time"] = reference_time
                condom_comment["content"] = content
                condom_comment["product_color"] = product_color
                condom_comment["user_client_show"] = user_client_show
                condom_comment["user_level_name"] = user_level_name
                condom_comment["is_mobile"] = is_mobile
                condom_comment["creation_time"] = creation_time
                condom_comment["guid"] = guid
                condom_comment["id"] = id
                collection_comment.insert(condom_comment)

        else:
            break

如果想要獲取抓取TT資料和評論的程式碼，請關注我的公眾號“python_ai_bigdata”,然後恢復TT獲取程式碼。

一共抓取了8934條產品資訊和17萬條評論(購買)記錄。

產品最多的品牌

先分析8934個產品，看下哪個品牌的TT在京東上賣得最多。由於品牌過多，京東上銷售TT的品牌就有299個，我們只取賣得最多的前10個品牌。
condom

從上面的圖可以看出，排名第1的是杜杜，岡本次之，邦邦第3，前10品牌分別是杜蕾斯、岡本、傑士邦、倍力樂、名流、第六感、尚牌、赤尾、諾絲和米奧。這10個品牌中有5個是我沒見過的，分別是倍力樂、名流、尚牌、赤尾和米奧，其他的都見過，特別是杜杜和邦邦常年佔據各大超市收銀臺的醒目位置。

這10個品牌中，杜蕾斯來自英國，岡本來自日本，傑士邦、第六感、赤尾、米奧和名流是國產的品牌，第六感是傑士邦旗下的一個避孕套品牌；倍力樂是中美合資的品牌，尚牌來自泰國，諾絲是來自美國的品牌。

程式碼：

import pymongo 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
from pandas import DataFrame,Series

client = pymongo.MongoClient(host='localhost',port=27017) 
db = client.condomdb

condom_new = db.condom_new

cursor = condom_new.find() 
condom_df = pd.DataFrame(list(cursor)) 

brand_name_df = condom_df['brand_name'].to_frame()
brand_name_df['condom_num'] = 1

brand_name_group = brand_name_df.groupby('brand_name').sum()

brand_name_sort = brand_name_group.sort_values(by='condom_num', ascending=False)
brand_name_top10 = brand_name_sort.head(10)


# print(3 * np.random.rand(4))
index_list = []
labels = []
value_list = []

for index,row in brand_name_top10.iterrows():
    index_list.append(index)
    labels.append(index)
    value_list.append(int(row['condom_num']))

plt.rcParams['font.sans-serif']=['SimHei'] #用來正常顯示中文標籤
plt.rcParams['axes.unicode_minus']=False #用來正常顯示負號

series_condom = pd.Series(value_list, index=index_list, name='')
series_condom.plot.pie(labels=labels,
                 autopct='%.2f', fontsize=10, figsize=(10, 10))

賣得最好的產品
可以根據產品評價數量來判斷一個產品賣得好壞，評價數最多的產品通常也是賣得最好的。

產品評論中有個產品評論總數的欄位，我們就根據這個欄位來排序，看下評論數量最多的前10個產品是什麼（也就是評論數量最多的）。
condom

從上圖可以看出，賣得最好的還是杜杜的產品，10席中佔了6席。杜杜的情愛四合一以1180000萬的銷量排名第一。

最受歡迎的是超薄的TT，佔了8席，持久型的也比較受歡迎，狼牙套竟然也上榜了，真是大大的出乎我的意料。

銷量分析
下圖是TT銷量最好的10天
condom
可以看出這10天分別分佈在6月、11月和12月，應該和我們熟知的618、雙11和雙12購物節有關。

現在很多電商都有自己的購物節，像618，雙11和雙12。由於一個產品最多隻能顯示100頁的評論，每頁10條評論，一個產品最多隻能爬取到1000條評論，對於銷量達到118萬的情愛四合一來說，1000條評論不具有代表性，但是總的來說通過上圖的分析，可以知道電商做活動的月份銷量一般比較好。

下圖是每個月份TT銷售量柱狀圖，更加驗證了上面的說法。
condom
11月的銷量最好，12月次之，6月份的銷量第三。

購物平臺
condom
通過京東app購買TT的最多，91%的使用者來自京東Android客戶端和iphone客戶端。6%的使用者來自PC端，這幾年4G的發展有關。

通過上面的分析可以知道，超薄的TT最受歡迎。杜杜的產品賣得最好，這和他們的營銷方案有關，杜杜的文案可以稱作教科書級的，每次釋出文案都引起大家的討論，堪稱個個經典。移動客戶端購買TT已經成為主流，佔據90%以上的流量。

下面分享幾個杜杜經典的文案。

雙11走心文案：

滴滴出行宣佈收購優步中國。
杜蕾斯：DUDU打車，老司機的選擇。

condom

王者榮耀最火時文案：

抓取某東的TT購買記錄分析TT購買趨勢

抓取某東的TT購買記錄分析TT購買趨勢

python+rabbitMQ抓取某婚戀網站用戶數據

利用Fiddler2的Custom Rules自動抓取App的TOKEN並記錄到文件

使用webpasser抓取某笑話網站整站內容

記一次失敗的直播抓取（包含相關知識點記錄）

爬蟲：抓取某年某月某日某地的天氣資訊

【fiddler】抓取的https的會話記錄時出現有“Tunnel to ...443”的會話記錄

原始套接字抓取所有乙太網資料包與分析

python爬蟲——requests抓取某電影網站top100

利用NodeJS抓取某商品資訊

用 Scrapy 抓取某家的樓盤資訊

抓取某店鋪的ebayno

web scraper 抓取資料並做簡單資料分析

如何用爬蟲抓取招聘網站的職位並分析

抓取60000+QQ空間說說做一次數據分析

python&php數據抓取、爬蟲分析與中介，有網址案例

關於“淘寶爆款”的數據抓取與數據分析

分析Ajax抓取今日頭條街拍美圖

分析Ajax請求並抓取今日頭條街拍美圖

爬蟲-day02-抓取和分析

抓取某東的TT購買記錄分析TT購買趨勢

相關推薦