1. 程式人生 > >python爬蟲:爬取貓眼電影(分數的處理和多執行緒)

python爬蟲:爬取貓眼電影(分數的處理和多執行緒)

爬取用的庫是requests和beautifulsoup,程式碼編寫不難,主要是個別的細節處理需要注意

1、電影得分的處理


右鍵審查元素,我們看到分數的整數部分和小數部分是分開的,在beautifulsoup中,我們可以用(.strings或者.stripped_strings),但是這樣取出來的內容是一個可迭代的生成器,只用用列表或字典才能看到結果

到網上搜羅了一圈終於找到解決辦法,也是基礎知識存在問題,解決的辦法如下


2、多執行緒

多執行緒要用到multiprocessing庫,具體方法參見程式碼

import requests
from  requests.exceptions import RequestException
from bs4 import BeautifulSoup
import time
import pymongo
from multiprocessing import Pool

client = pymongo.MongoClient('localhost',27017)
movie = client['movie']
maoyan = movie['maoyan']
# for item in maoyan.find():
#     print(item)
def get_one_page(url):
    try:
        reponse = requests.get(url)
        if reponse.status_code == 200:
            return reponse.text
        return None
    except RequestException:
        return None
def parse_one_page(html):
    soup = BeautifulSoup(html,'lxml')
    # print(soup)
    ranks = soup.select('#app > div > div > div.main > dl > dd > i')#text
    titles = soup.select('div.movie-item-info p.name a')#text
    stars = soup.select('div.movie-item-info p.star')#text
    times = soup.select('div.movie-item-info p.releasetime')
    scores = soup.select('#app > div > div > div.main > dl > dd > div > div > div.movie-item-number.score-num > p')
    for rank,title,star,time,score in zip(ranks,titles,stars,times,scores):
        data = {
            'rank':rank.text.strip(),
            'title':title.text.strip(),
            'star':star.text.strip(),
            'time':time.text.strip(),
            'score':''.join(score.strings)
        }
        maoyan.insert(data)
        print('yes')


def main(offset):
    url = 'http://maoyan.com/board/4?offset=' + str(offset)
    html = get_one_page(url)
    # print(html)
    parse_one_page(html)
if __name__=='__main__':
    pool = Pool()
    pool.map(main,[i*10 for i in range(10)])