Python3[爬蟲實戰] scrapy爬取汽車之家全站連結存json檔案

阿新 • • 發佈：2019-01-18

昨晚晚上一不小心學習了崔慶才，崔大神的部落格，試著嘗試一下爬取一個網站的全部內容，福利吧網站現在已經找不到了，然後一不小心逛到了汽車之家 (http://www.autohome.com.cn/beijing/)

很喜歡這個網站，女人都喜歡車，更何況男人呢。（捂臉）

說一下思路：
1 . 使用CrawlSpider 這個spider，
2. 使用Rule
上面這兩個配合使用可以起到爬取全站的作用

3. 使用LinkExtractor  配合Rule可以進行url規則的匹配
4. FormRequest  這是scrapy 登陸使用的一個包

注意：這裡進行全站的爬取只是單純的把以 .html

的url進行列印，儲存到json檔案，

這裡我們還可以繼續往下深入的，進行url下的內容提取。

說一下提取的思路：這裡我們可以隨便找一個url下的內容，然後找到想要提取到的內容，進行xpath提取，

xpath 的一般提取規則：選中想要提取內容的那一行，然後右鍵copy --> copy xpath  就可以啦，這裡老司機說是最好用chrom瀏覽器的xpath，火狐可能有時候提取不到想要的元素，

xpath提取的簡單並且常用的規則：

//*[@id=”post_content”]/p[1]

意思是：在根節點下面的有一個id為post_content的標籤裡面的第一個p標籤（p[1 
]）

如果你需要提取的是這個標籤的文字你需要在後面加點東西變成下面這樣：

//*[@id=”post_content”]/p[1]/text()

後面加上text()標籤就是提取文字

如果要提取標籤裡面的屬性就把text()換成@屬性比如：

//*[@id=”post_content”]/p[1]/@src

So Easy！XPath提取完畢！來看看怎麼用的！那就更簡單了！！！！

response.xpath(‘你Copy的XPath’).extract()[‘要取第幾個值’]

注意XPath提取出來的預設是List。

上面就是簡單的提取規則，是不是很容易懂，我覺著也是，比之前學的容易懂多了，可能我現在還是個小白吧。哈哈哈。

附錄一下：

關於imgurl那個XPath：

你先隨便找一找圖片的地址Copy XPath類似得到這樣的：

//*[@id=”post_content”]/p[2]/img

你瞅瞅網頁會發現每一個有幾張圖片 每張地址都在一個p標籤下的img標籤的src屬性中

把這個2去掉變成：

//*[@id=”post_content”]/p/img

就變成了所有p標籤下的img標籤了！加上 /@src 後所有圖片就獲取到啦！（不加[0]是因為我們要所有的地址、加了 就只能獲取一個了！）

關於XPath更多的用法與功能詳解，建議大家去看看w3cschool

看來我確實沒有怎麼看w3c啊。還是抓個時間去看一下比較好，畢竟是基礎嘛。

大概：廢話就這麼多，我真是個話癆，感覺。

貼上程式碼片吧，裡面的內容註釋都很詳細。

步驟1：

spider裡面的檔案

# -*- coding: utf-8 -*-
# @Time    : 2017/8/27 0:43
# @Author  : 蛇崽
# @Email   : [email protected]  （主要進行全站爬取的練習）
# @File    : LongXunDaoHangSpider.py 

# crawlspider,rule配合使用可以起到遍歷全站的作用，request為請求的介面
from scrapy.spider import CrawlSpider,Rule,Request

# 配合使用Rule進行url規則匹配
from scrapy.linkextractors import LinkExtractor

# scrapy 中用作登陸使用的一個包
from scrapy import FormRequest
from allNet.items import LongXunDaoHang

class longxunDaoHang(CrawlSpider):

    name = 'longxun'

    allowed_domains = ['autohome.com.cn']

    start_urls = ['http://www.autohome.com.cn/shanghai/']

    rules = (
        Rule(LinkExtractor(allow=('\.html',)),callback='parse_item',follow=True),
    )



    def parse_item(self,response):
        print(response.url)
        daohang = LongXunDaoHang()
        daohang['categoryLink'] = response.url
        yield daohang

步驟2：

settings.py的內容：

# -*- coding: utf-8 -*-

# Scrapy settings for allNet project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'allNet'

SPIDER_MODULES = ['allNet.spiders']
NEWSPIDER_MODULE = 'allNet.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'allNet (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = True

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'allNet.middlewares.AllnetSpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'allNet.middlewares.MyCustomDownloaderMiddleware': 543,
#     'allNet.middleware.JsonWritePipline':300,
# }

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'allNet.pipelines.AllnetPipeline': 300,
'allNet.pipelines.JsonWritePipline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

步驟3：

piplines.py的內容

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json


class AllnetPipeline(object):
    def process_item(self, item, spider):
        return item
# 寫入json檔案
class JsonWritePipline(object):
    def __init__(self):
        self.file = open('汽車之家全站url.json','w',encoding='utf-8')

    def process_item(self,item,spider):
        line  = json.dumps(dict(item),ensure_ascii=False)+"\n"
        self.file.write(line)
        return item

    def spider_closed(self,spider):
        self.file.close()

很奇怪的是，汽車之家這裡用的cookie什麼的都沒有進行設定，但是爬取全站這玩意，它就一直沒有報錯，昨天晚上十二點左右寫的程式碼，想著用scrapy應該不一會就爬取完了吧，但是現在早上還一直在爬，我也是醉了，晚上好幾次電腦進行休眠了，然後我又把他重新弄亮了，現在有點奇葩的是，現在spider還在執行著，但是json檔案寫不進去了，蠻怪怪的。最後上張爬取成果圖吧：

這裡寫圖片描述

這裡留給自己一個作業：在爬取的url中進行資料的提取，儲存，簡單點：就是url下面內容的進行儲存。（捂臉.jpg）

Python3[爬蟲實戰] scrapy爬取汽車之家全站連結存json檔案

昨晚晚上一不小心學習了崔慶才，崔大神的部落格，試著嘗試一下爬取一個網站的全部內容，福利吧網站現在已經找不到了，然後一不小心逛到了汽車之家 (http://www.autohome.com.cn/beijing/)

xpath提取的簡單並且常用的規則：

步驟1：

步驟2：

步驟3：

Python3[爬蟲實戰] scrapy爬取汽車之家全站連結存json檔案

Python練習 scrapy 爬取汽車之家文章

python爬蟲實戰爬取汽車之家上車型價格

python3 爬取汽車之家所有車型操作步驟

python網路爬蟲爬取汽車之家的最新資訊和照片

WebMagic爬蟲入門教程（三）爬取汽車之家的例項-品牌車系車型結構等

python爬蟲——爬取汽車之家新聞

【Python3爬蟲】Scrapy爬取豆瓣電影TOP250

Python3爬蟲實戰：爬取大眾點評網某地區所有酒店相關資訊

python入門-----爬取汽車之家新聞,---自動登錄抽屜並點贊,

爬取汽車之家

爬取汽車之家北京二手車資訊

python3程式設計08-爬蟲實戰：爬取網路圖片

python3程式設計07-爬蟲實戰：爬取新聞網站資訊3

初識Scrapy框架+爬蟲實戰(7)-爬取鏈家網100頁租房資訊

Python爬蟲實戰詳解：爬取圖片之家

python 爬蟲實戰4 爬取淘寶MM照片

【Python3 爬蟲】14_爬取淘寶上的手機圖片

教你分分鐘學會用python爬蟲框架Scrapy爬取你想要的內容

python3爬蟲-快速入門-爬取圖片和標題

Python3[爬蟲實戰] scrapy爬取汽車之家全站連結存json檔案

昨晚晚上一不小心學習了崔慶才，崔大神的部落格，試著嘗試一下爬取一個網站的全部內容，福利吧網站現在已經找不到了，然後一不小心逛到了汽車之家 (http://www.autohome.com.cn/beijing/)

xpath提取的簡單並且常用的規則：

步驟1：

步驟2：

步驟3：

相關推薦