爬蟲實例1-爬取新聞列表和發布時間
一、新建工程
scrapy startproject shop |
二、Items.py文件代碼:
import scrapy
class ShopItem(scrapy.Item): title = scrapy.Field() time = scrapy.Field() |
三、shopspider.py文件爬蟲代碼
# -*-coding:UTF-8-*- import scrapy from shop.items import ShopItem
class shopSpider(scrapy.Spider): name = "shop" allowed_domains = ["news.xxxxxxx.xx.cn"] start_urls = ["http://news.xxxxx.xxx.cn/hunan/"]
def parse(self,response): item = ShopItem() item[‘title‘] = response.xpath("//div[@class=‘txttotwe2‘]/ul/li/a/text()").extract() item[‘time‘] = response.xpath("//div[@class=‘txttotwe2‘]/ul/li/font/text()").extract() yield item |
四、pipelines.py文件代碼(打印出內容):
註意:如果在shopspider.py文件中打印出內容則顯示的是unicode編碼,而在pipelines.py打印出來的信息則是正常的顯示內容。
class ShopPipeline(object): def process_item(self, item, spider): count=len(item[‘title‘]) print ‘news count: ‘ ,count for i in range(0,count): print ‘biaoti: ‘+item[‘title‘][i] print ‘shijian: ‘+item[‘time‘][i] return item |
五、爬取顯示的結果:
[email protected]:~/shop# scrapy crawl shop --nolog news count: 40 biaoti: xxx建成國家食品安全示範城市 shijian: (2017-06-16) biaoti: xxxx考試開始報名 …………………… ………………….. |
爬蟲實例1-爬取新聞列表和發布時間