python爬蟲:爬取貓眼電影(分數的處理和多執行緒)
阿新 • • 發佈:2019-01-27
爬取用的庫是requests和beautifulsoup,程式碼編寫不難,主要是個別的細節處理需要注意
1、電影得分的處理
右鍵審查元素,我們看到分數的整數部分和小數部分是分開的,在beautifulsoup中,我們可以用(.strings或者.stripped_strings),但是這樣取出來的內容是一個可迭代的生成器,只用用列表或字典才能看到結果
到網上搜羅了一圈終於找到解決辦法,也是基礎知識存在問題,解決的辦法如下
2、多執行緒
多執行緒要用到multiprocessing庫,具體方法參見程式碼
import requests from requests.exceptions import RequestException from bs4 import BeautifulSoup import time import pymongo from multiprocessing import Pool client = pymongo.MongoClient('localhost',27017) movie = client['movie'] maoyan = movie['maoyan'] # for item in maoyan.find(): # print(item) def get_one_page(url): try: reponse = requests.get(url) if reponse.status_code == 200: return reponse.text return None except RequestException: return None def parse_one_page(html): soup = BeautifulSoup(html,'lxml') # print(soup) ranks = soup.select('#app > div > div > div.main > dl > dd > i')#text titles = soup.select('div.movie-item-info p.name a')#text stars = soup.select('div.movie-item-info p.star')#text times = soup.select('div.movie-item-info p.releasetime') scores = soup.select('#app > div > div > div.main > dl > dd > div > div > div.movie-item-number.score-num > p') for rank,title,star,time,score in zip(ranks,titles,stars,times,scores): data = { 'rank':rank.text.strip(), 'title':title.text.strip(), 'star':star.text.strip(), 'time':time.text.strip(), 'score':''.join(score.strings) } maoyan.insert(data) print('yes') def main(offset): url = 'http://maoyan.com/board/4?offset=' + str(offset) html = get_one_page(url) # print(html) parse_one_page(html) if __name__=='__main__': pool = Pool() pool.map(main,[i*10 for i in range(10)])