python爬蟲——爬取豆瓣電影top250資訊並載入到MongoDB資料庫中

阿新 • • 發佈：2019-01-14

最近在學習關於爬蟲方面的知識，因為剛開始接觸，還是萌新，所以有什麼錯誤的地方，歡迎大家指出
from multiprocessing import Pool
from urllib.request import Request, urlopen
import re, pymongo

index = 0

class DouBanSpider(object):
client = pymongo.MongoClient('localhost')
db = client['dbmovie'] #程序池無法序列化pymongo物件，因為pymongo資料庫中含有執行緒鎖。

def __init__(self):
self.headers = {
'User-Agent': '這裡新增自己的瀏覽器代理'
,'Cookie': '豆瓣需要登入後才能訪問爬取資訊所以要加上自己的Cookie'
}
self.tool = DataTool()

def get_list_html(self, page_num):
page_num = (page_num - 1) * 25
list_url = 'https://movie.douban.com/top250?start={}'.format(page_num)
request = Request(list_url, headers=self.headers)
try:
response = urlopen(request)
except Exception as e:
print('請求失敗：地址{},原因{}'.format(list_url,e))
return None
else:
html = response.read().decode()
return html
def parse_list_html(self, html):
if html:
pattern = re.compile(r'<div class="hd">.*?<a href="(.*?)" class.*?>.*?', re.S)
detail_urls = re.findall(pattern, html)
# for detail_url in detail_urls:
# print(detail_url)
return detail_urls
else:
print('html原始碼為None')
return None

def get_detail_html(self, detail_url):
request = Request(detail_url, headers=self.headers)
try:
response = urlopen(request)
except Exception as e:
print('請求失敗：地址{},原因{}'.format(detail_url,e))
return None
else:
detail_html = response.read().decode()
return detail_html

def parse_detail_html(self, detail_html):
dic={}
data = re.findall(re.compile(
r'<h1>.*?(.*?).*?<div id="info">.*?<a href=.*?>(.*?)</a>.*?<a href=.*?>(.*?)</a>.*?.*?<a href=.*?>(.*?)</a> .*?.*?(.*?) .*?.*? (.*?) .*?.*?(.*?) .*? (.*?) .*?.*? (.*?).*? .*?.*?(.*?) .*?', re.S), detail_html)[0]
global index
index = index+1
print(index, data)
 print('影片名：', data[0])
 print('導演：', data[1])
 print('編劇：', data[2])
 print('主演：', data[3])
print('型別：', data[4])
print('製片國家/地區：', data[5])
print('語言：', data[6])
 print('上映日期：', data[7])
 print('片長：', data[8])
print('又名：', data[9])
dic['影片名'] = data[0]
dic['導演'] = data[1]
dic['編劇'] = data[2]
dic['主演'] = data[3]
dic['型別'] = data[4]
dic['製片國家/地區'] = data[5]
dic['語言'] = data[6]
dic['上映日期'] = data[7]
dic['片長'] = data[8]
dic['又名'] = data[9]
self.db['movie'].insert_one(dic)

def start_spider(self, num):
i = 0
print('正在請求第{}頁'.format(num))
list_html = self.get_list_html(num)
if list_html:
detail_urls = self.parse_list_html(list_html)
if detail_urls:
for detail_url in detail_urls:
i = i+1
if i != 164: # 164 因為二十二那部電影沒有又名導致爬不出來所以手動過濾
detail_html = self.get_detail_html(detail_url)
else:
continue
if detail_html:
self.parse_detail_html(detail_html)

if __name__ == '__main__':
obj = DouBanSpider()
pool = Pool(1)
pool.map(obj.start_spider, [x for x in range(1, 10)])
pool.close()
pool.join()

執行結果：

寫爬蟲一定要有耐心，有時候爬不出來資訊很有可能是因為正則表示式寫錯了！所以要細心。

python爬蟲——爬取豆瓣電影top250資訊並載入到MongoDB資料庫中

python爬蟲——爬取豆瓣電影top250資訊並載入到MongoDB資料庫中

（7）Python爬蟲——爬取豆瓣電影Top250

用Python爬蟲爬取豆瓣電影、讀書Top250並排序

初學python：用簡單的爬蟲爬取豆瓣電影TOP250的排名

php爬蟲爬取豆瓣電影top250內容

案例學python——案例三：豆瓣電影資訊入庫一起學爬蟲——通過爬取豆瓣電影top250學習requests庫的使用

【Python爬蟲】Scrapy框架運用1—爬取豆瓣電影top250的電影資訊(1)

Python網路爬蟲：利用正則表示式爬取豆瓣電影top250排行前10頁電影資訊

python爬蟲--爬取豆瓣top250電影名

python爬取豆瓣電影Top250的資訊

[Python/爬蟲]利用xpath爬取豆瓣電影top250

python爬蟲（一）爬取豆瓣電影Top250

python抓取豆瓣電影top250資訊

一起學爬蟲——通過爬取豆瓣電影top250學習requests庫的使用

python實踐2——利用爬蟲抓取豆瓣電影TOP250資料及存入資料到MySQL資料庫

【go語言爬蟲】go語言爬取豆瓣電影top250

Scrapy爬蟲（4）爬取豆瓣電影Top250圖片

python爬取豆瓣電影top250

Python爬蟲入門 | 7 分類爬取豆瓣電影，解決動態載入問題

爬蟲]利用xpath爬取豆瓣電影top250（轉）

python爬蟲——爬取豆瓣電影top250資訊並載入到MongoDB資料庫中

相關推薦