1. 程式人生 > >Scrapy框架CrawlSpider類爬蟲例項

Scrapy框架CrawlSpider類爬蟲例項

CrawlSpider類爬蟲中:

rules用於定義提取URl地址規則,元祖資料有順序

    #LinkExtractor 連線提取器,提取url地址

  #callback 提取出來的url地址的response會交給callback處理

 #follow 當前url地址的響應是否重新經過rules進行提取url地址

cf.py具體實現程式碼如下(簡化版):

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractors import LinkExtractor
4 from scrapy.spiders import CrawlSpider, Rule 5 import re 6 7 class CfSpider(CrawlSpider): 8 name = 'cf' 9 allowed_domains = ['bxjg.circ.gov.cn'] 10 start_urls = ['http://bxjg.circ.gov.cn/web/site0/tab5240/Default.htm'] 11 12 rules = ( 13 Rule(LinkExtractor(allow=r'/web/site0/tab5240/info\d+\.htm
'), callback='parse_item', ), 14 Rule(LinkExtractor(allow=r'/web/site0/tab5240/module14430/page\d+\.htm'),follow=True, ), 15 ) 16 17 def parse_item(self, response): 18 item = {} 19 item['title'] = re.findall("<!--TitleStart-->(.*?)<!--TitleEnd-->", response.body.decode())[0]
20 item['publish_date'] = re.findall("釋出時間:(20\d{2}-\d{2}-\d{2})", response.body.decode())[0] 21 print(item)