爬蟲學習之18:使用selenium和chrome-headerless爬取淘寶網商品資訊(非同步載入網頁)
阿新 • • 發佈:2019-02-01
登入淘寶網,使用F12鍵觀察網頁結構,會發現淘寶網也是非同步載入網站。有時候通過逆向工程區爬取這類網站也不容易。這裡使用selenium和chrome-headerless來爬取。網上有結合selenium和PlantomJS來爬取的,但是最新版的Selenium已經放棄對PlantomJS的支援,所以這裡使用chrome-headerless,方法其實差不多,由於selenium可以模擬瀏覽器行為,所以對這類非同步載入的網站爬取起來會更容易些。本實驗模擬從瀏覽器登入淘寶,並搜尋淘寶中的口紅商品,抓取的內容包括口紅的名稱、連結、店鋪、售價,購買人數和地址資訊等,結果儲存在MongoDB中。程式碼如下:
from selenium import webdriver from selenium.webdriver.chrome.options import Options from lxml import etree import time import pymongo client = pymongo.MongoClient('localhost',27017) mydb = client['mydb'] taobao = mydb['taobao'] #driver =webdriver.PhantomJS() chrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') driver = webdriver.Chrome(executable_path='E:/Pyhton_lib/chromedriver_win32/chromedriver.exe', chrome_options=chrome_options) driver.maximize_window() def get_info(url,page): page = page+1 driver.get(url) driver.implicitly_wait(10) selector = etree.HTML(driver.page_source) infos = selector.xpath('//div[@class="item J_MouserOnverReq "]') for info in infos: goods_url = infos[0].xpath('div[1]/div/div/a/@href')[0] #print("goods_url:%s" % (goods_url)) goods = infos[0].xpath('div[1]/div/div/a/img/@alt')[0] print("goods:%s" % (goods)) price = infos[0].xpath('div[2]/div/div/strong/text()')[0] #print("price:%s" % (price)) sell = infos[0].xpath('div[2]/div/div[@class="deal-cnt"]/text()')[0] #print("sell:%s" % (sell)) shop = infos[0].xpath('div[2]/div[3]/div[1]/a/span[2]/text()')[0] #print("shop:%s" % (shop)) address = infos[0].xpath('div[2]/div[3]/div[2]/text()')[0] #print("address:%s" % (address)) commodity = { 'goods_url':goods_url, 'goods':goods, 'price': price, 'sell': sell, 'shop': shop, 'address': address, } taobao.insert_one(commodity) if page<=100: NextPage(url,page) else: pass def NextPage(url,page): driver.get(url) driver.implicitly_wait(10) driver.find_element_by_xpath('//a[@trace="srp_bottom_pagedown"]').click() time.sleep(4) driver.get(driver.current_url) driver.implicitly_wait(10) get_info(driver.current_url,page) if __name__=="__main__": page=1 url = 'https://www.taobao.com' driver.get(url) driver.implicitly_wait(10) driver.find_element_by_id('q').clear() driver.find_element_by_id('q').send_keys('口紅') driver.find_element_by_class_name('btn-search').click() get_info(driver.current_url,page)
部分結果截圖如下: