1. 程式人生 > >43.scrapy爬取鏈家網站二手房信息-1

43.scrapy爬取鏈家網站二手房信息-1

response ons tro 問題 import xtra dom nts class

首先分析:
目的:采集鏈家網站二手房數據
1.先分析一下二手房主界面信息,顯示情況如下:

url = https://gz.lianjia.com/ershoufang/pg1/
顯示總數據量為27589套,但是頁面只給返回100頁的數據,每頁30條數據,也就是只給返回3000條數據。

技術分享圖片

2.再看一下篩選條件的情況:

100萬以下(775):https://gz.lianjia.com/ershoufang/pg1p1/(p1是篩選條件參數,pg1是頁面參數) 頁面返回26頁信息
100萬-120萬(471):https://gz.lianjia.com/ershoufang/pg1p2/ 頁面返回16頁信息

以此類推也就是網站只給你返回查看最多100頁,3000條的數據,登陸的話情況也是一樣的情況。

3.采集代碼如下:
這個是 linjia.py 文件,這裏需要註意的問題就是 setting裏要設置
ROBOTSTXT_OBEY = False,不然頁面不給返回數據。

技術分享圖片
# -*- coding: utf-8 -*-
import scrapy


class LianjiaSpider(scrapy.Spider):
    name = lianjia
    allowed_domains = [gz.lianjia.com]
    start_urls = [https://gz.lianjia.com/ershoufang/pg1/]

    def parse(self, response):

        
#獲取當前頁面url link_urls = response.xpath("//div[@class=‘info clear‘]/div[@class=‘title‘]/a/@href").extract() for link_url in link_urls: # print(link_url) yield scrapy.Request(url=link_url,callback=self.parse_detail) print(**100) #翻頁 for i in range(1,101): url
= https://gz.lianjia.com/ershoufang/pg{}/.format(i) # print(url) yield scrapy.Request(url=url,callback=self.parse) def parse_detail(self,response): title = response.xpath("//div[@class=‘title‘]/h1[@class=‘main‘]/text()").extract_first() print(標題: + title) dist = response.xpath("//div[@class=‘areaName‘]/span[@class=‘info‘]/a/text()").extract_first() print(所在區域: + dist) contents = response.xpath("//div[@class=‘introContent‘]/div[@class=‘base‘]") # print(contents) house_type = contents.xpath("./div[@class=‘content‘]/ul/li[1]/text()").extract_first() print(房屋戶型: + house_type) floor = contents.xpath("./div[@class=‘content‘]/ul/li[2]/text()").extract_first() print(所在樓層: + floor) built_area = contents.xpath("./div[@class=‘content‘]/ul/li[3]/text()").extract_first() print(建築面積: + built_area) family_structure = contents.xpath("./div[@class=‘content‘]/ul/li[4]/text()").extract_first() print(戶型結構: + family_structure) inner_area = contents.xpath("./div[@class=‘content‘]/ul/li[5]/text()").extract_first() print(套內面積: + inner_area) architectural_type = contents.xpath("./div[@class=‘content‘]/ul/li[6]/text()").extract_first() print(建築類型: + architectural_type) house_orientation = contents.xpath("./div[@class=‘content‘]/ul/li[7]/text()").extract_first() print(房屋朝向: + house_orientation) building_structure = contents.xpath("./div[@class=‘content‘]/ul/li[8]/text()").extract_first() print(建築結構: + building_structure) decoration_condition = contents.xpath("./div[@class=‘content‘]/ul/li[9]/text()").extract_first() print(裝修狀況: + decoration_condition) proportion = contents.xpath("./div[@class=‘content‘]/ul/li[10]/text()").extract_first() print(梯戶比例: + proportion) elevator = contents.xpath("./div[@class=‘content‘]/ul/li[11]/text()").extract_first() print(配備電梯: + elevator) age_limit =contents.xpath("./div[@class=‘content‘]/ul/li[12]/text()").extract_first() print(產權年限: + age_limit) try: house_label = response.xpath("//div[@class=‘content‘]/a/text()").extract_first() except: house_label = ‘‘ print(房源標簽: + house_label) # decoration_description = response.xpath("//div[@class=‘baseattribute clear‘][1]/div[@class=‘content‘]/text()").extract_first() # print(‘裝修描述 ‘+ decoration_description) # community_introduction = response.xpath("//div[@class=‘baseattribute clear‘][2]/div[@class=‘content‘]/text()").extract_first() # print(‘小區介紹: ‘+ community_introduction) # huxing_introduce = response.xpath("//div[@class=‘baseattribute clear‘]3]/div[@class=‘content‘]/text()").extract_first() # print(‘戶型介紹: ‘+ huxing_introduce) # selling_point = response.xpath("//div[@class=‘baseattribute clear‘][4]/div[@class=‘content‘]/text()").extract_first() # print(‘核心賣點: ‘+ selling_point) # 以追加的方式及打開一個文件,文件指針放在文件結尾,追加讀寫! with open(text, a, encoding=utf-8)as f: f.write(\n.join( [title,dist,house_type,floor,built_area,family_structure,inner_area,architectural_type,house_orientation,building_structure,decoration_condition,proportion,elevator,age_limit,house_label])) f.write(\n + = * 50 + \n) print(-*100)
4.這裏采集的是全部,沒設置篩選條件,只返回100也數據。
采集數據情況如下:
這裏只采集了15個字段信息,其他的數據沒采集。
采集100頁,算一下拿到了2704條數據。
技術分享圖片技術分享圖片 4.這個是上周寫的,也沒做修改完善,之後會對篩選條件url進行整理,盡量采集網站多的數據信息。

43.scrapy爬取鏈家網站二手房信息-1