1. 程式人生 > 實用技巧 >scrapy ,mongoDB爬取各種型別書籍評價

scrapy ,mongoDB爬取各種型別書籍評價

整體效果:

整體思路:

通過標籤頁的分類連結,獲取全部書籍連結

第一步:調整settings檔案

ROBOTSTXT_OBEY = False   #rebots協議關閉
DOWNLOAD_DELAY = 1  #下載延遲,儘量開啟
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent':"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36"
}#加入請求頭,偽裝自己
新寫一個start.py檔案,用以開始scrapy服務,方便以後除錯
from scrapy import cmdline

cmdline.execute("scrapy crawl douban".split())

第2步:在item裡寫入想要的欄位

book_title = scrapy.Field()
book_link = scrapy.Field()
range_nums = scrapy.Field()
pl = scrapy.Field()

第3步:正式工作開始,爬取標籤頁包含的連結

start_urls = ['https://book.douban.com/tag/?view=type&icn=index-sorttags-all']

起始連結:https://book.douban.com/tag/?view=type&icn=index-sorttags-all包含所有標籤連結。進行解析拼接想要的連結

def parse(self, response):
divs = response.xpath('//div[@class="article"]/div[2]')
for div in divs:
names = div.xpath('.//a/h2/text()')
# print(names)
trs = div.xpath('.//table[@class="tagCol"]/tbody')
for tr in trs:
tds = tr.xpath('.//tr')
for td in tds:
td_links = td.xpath('.//a/@href').extract()
for td in td_links:
detail_url = self.url + td
            yield scrapy.Request(url=detail_url, callback=self.parse_tag)#訪問詳情頁

第4步:對詳情頁進行解析,獲取想要的欄位。通過item返回給管道。

def parse_tag(self,response):
# names = response.meta.get("info")
# print(response.url)
lis = response.xpath('//div[@id="subject_list"]/ul')
for li in lis:
book_title = li.xpath('.//div[@class="info"]/h2/a/@title').getall()
book_link = li.xpath('.//div[@class="info"]/h2/a/@href').getall()
range_nums = li.xpath('.//div[@class="star clearfix"]/span[2]/text()').getall()
pl = li.xpath('.//div[@class="star clearfix"]/span[3]/text()').getall()
pls = []
for i in range(len(pl)):
pls.append(pl[i].strip())
for a,b,c,d in zip(book_title,book_link,range_nums,pls):
item = DushuItem(book_link=b, book_title=a, range_nums=c, pl=d)
yield item
第5步:獲取一頁的連結進行判斷並請求
next_url = response.xpath('//span[@class="next"]/a/@href').extract()
if next_url:
next_link = self.url + next_url[0]
yield scrapy.Request(url=next_link,callback=self.parse_tag)

第6部:儲存item到mongoDB

開啟mongoDB服務:

在cmd 中輸入 mongod --dbpath="D:\MongoDB\db"

在pipeline寫入:

from pymongo import MongoClient
from scrapy import Item
class MongoDBPipeline(object):

# 開啟資料庫
def open_spider(self, spider):
db_uri = spider.settings.get('MONGODB_URI', 'mongodb://localhost:27017')
db_name = spider.settings.get('MONOGDB_DB_NAME', 'scrapy_db')

self.db_client = MongoClient(db_uri)
self.db = self.db_client[db_name]

# 關閉資料庫
def close_spider(self, spider):
self.db_client.close()

# 對資料進行處理
def process_item(self, item, spider):
self.insert_db(item)
return item

# 插入資料
def insert_db(self, item):
if isinstance(item, Item):
item = dict(item)
self.db.books.insert(item)

在settings加入
MONGODB_URI = 'mongodb://127.0.0.1:27017'
MONGODB_DB_NAME = 'scrapy_db'

第7步:開啟pipeline
ITEM_PIPELINES = {
# 'dushu.pipelines.DushuPipeline': 300,
'dushu.pipelines.MongoDBPipeline': 300,

}

啟動