1. 程式人生 > >Python爬蟲【實戰篇】scrapy 框架爬取某招聘網存入mongodb

Python爬蟲【實戰篇】scrapy 框架爬取某招聘網存入mongodb

建立專案

scrapy startproject zhaoping

建立爬蟲

cd zhaoping
scrapy genspider hr zhaopingwang.com

目錄結構

items.py

    title = scrapy.Field()
    position = scrapy.Field()
    publish_date = scrapy.Field()

pipelines.py

from pymongo import MongoClient

mongoclient = MongoClient(host='
192.168.226.150',port=27017) collection = mongoclient['zhaoping']['hr'] class TencentPipeline(object): def process_item(self, item, spider): print(item) # 需要轉換為 dict collection.insert(dict(item)) return item

spiders/hr.py

    def parse(self, response):
        
# 不要第一個 和最後一個 tr_list = response.xpath("//table[@class='tablelist']/tr")[1:-1] for tr in tr_list: item = TencentItem() # xpath 從1 開始數起 item["title"] = tr.xpath("./td[1]/a/text()").extract_first() item["position"] = tr.xpath("./td[2]/text()
").extract_first() item["publish_date"] = tr.xpath("./td[5]/text()").extract_first() yield item next_url = response.xpath("//a[@id='next']/@href").extract_first() # 構造url if next_url != "javascript:;": print(next_url) next_url = "https://hr.tencent.com/" + next_url yield scrapy.Request(url=next_url,callback=self.parse,)

就是這麼簡單,就獲取到資料