1. 程式人生 > >python html抓取,並用re正則表示式解析(一)

python html抓取,並用re正則表示式解析(一)

html抓取,並用re進行解析

#coding=utf-8

import urllib.request
import re

'''
url :"http://money.163.com/special/pinglun/"
抓取第一頁的新聞資訊,並按照以下規格輸出。

[
  {'title':'生鮮電商為何難盈利?','created_at':'2013-05-03 08:43','url':'http://money.163.com/13/0503/08/8TUHSEEI00254ITK.html'}

  {'title':'生鮮電商為何難盈利?','created_at':'2013-05-03 08:43','url':'http://money.163.com/13/0503/08/8TUHSEEI00254ITK.html'}
]
'''
url = 'http://money.163.com/special/pinglun/' result = [] f = urllib.request.urlopen(url) #<meta http-equiv="Content-Type" content="text/html; charset=gbk"> #因為網頁的編碼格式是gbk,所以在解碼的時候也需要用gbk解碼 content = f.read().decode('gbk') # content = str(f.read(),'utf-8','ignore') #獲取所需內容的模式物件,按此模式從url從獲取對應符合的內容
pattern = re.compile(r'<div class="list_item clearfix">.*?</span>',re.S) #過濾html,得到滿足上面模式的內容 basic_content = re.finditer(pattern,content) #對初步內容進行加工,得到自己想要的title、created_at、url三個內容 for i in basic_content: init_dict = {} d = re.match(r'<div class="list_item clearfix">.*?<h2><a href="(.*?)">(.*?)</a></h2>.*?<span class="time">(.*?)</span>'
,i.group(),re.S) init_dict['title'] = d.group(2) init_dict['created_at'] = d.group(3) init_dict['url'] = d.group(1) result.append(init_dict) print (result)

輸出內容

[
{'title': '賈躍亭的成功意味著實體失敗?', 'created_at': '2016-04-25 14:28:18', 'url': 'http://money.163.com/16/0425/14/BLGM1PH5002551G6.html'}, 
{'title': '海爾模式為何在西方叫好不叫座', 'created_at': '2016-04-22 15:00:23', 'url': 'http://money.163.com/16/0422/15/BL90MCB400253G87.html'}, 
{'title': '有前科就不能開網約車?', 'created_at': '2016-04-12 15:30:49', 'url': 'http://money.163.com/16/0412/15/BKFAETGB002552IJ.html'}, 
{'title': '影業公司能助網路視訊擡身價嗎', 'created_at': '2016-03-31 13:43:27', 'url': 'http://money.163.com/16/0331/13/BJG7HME600253G87.html'}, 
{'title': '美的收購東芝究竟值不值?', 'created_at': '2016-03-31 08:48:45', 'url': 'http://money.163.com/16/0331/08/BJFMM2AB00253G87.html'}, 
{'title': '日本家電企業真的不行了嗎?', 'created_at': '2016-03-18 16:40:02', 'url': 'http://money.163.com/16/0318/16/BIF2FM7A002551G6.html'}, 
{'title': '淘寶只是中國製造亂象的鏡子', 'created_at': '2016-03-16 09:56:58', 'url': 'http://money.163.com/16/0316/09/BI96K6L000253G87.html'}, 
{'title': 'iPhone 6s太失敗? 蘋果需創新', 'created_at': '2016-01-26 14:45:14', 'url': 'http://money.163.com/16/0126/14/BE8V83A500253G87.html'}, 
{'title': '從貼吧事件看大公司如何擔責', 'created_at': '2016-01-18 16:02:05', 'url': 'http://money.163.com/16/0118/16/BDKGF2C000253G87.html'},
{'title': '銷量不佳股價跌 蘋果錯在哪裡', 'created_at': '2016-01-11 14:49:43', 'url': 'http://money.163.com/16/0111/14/BD2BHH85002551G6.html'},
{'title': '視訊網站為何對快播痛下殺手?', 'created_at': '2016-01-11 14:30:31', 'url': 'http://money.163.com/16/0111/14/BD2AEC0E002551G6.html'},
{'title': '黎萬強重振小米是個偽命題?', 'created_at': '2016-01-05 13:51:55', 'url': 'http://money.163.com/16/0105/13/BCIPRCDP002551G6.html'},
{'title': '手機廠商頻死亡 將大洗牌?', 'created_at': '2015-12-31 12:14:33', 'url': 'http://money.163.com/15/1231/12/BC5O9GEI002551G6.html'}, 
{'title': '2015三星與蘋果暗戰勝負幾何?', 'created_at': '2015-12-29 14:55:41', 'url': 'http://money.163.com/15/1229/14/BC0SN3OC002551G6.html'},
{'title': '寶能作為門口野蠻人是壞人嗎', 'created_at': '2015-12-19 12:31:57', 'url': 'http://money.163.com/15/1219/12/BB6SGNBI002551G6.html'}
]

如果解碼的時候用的是utf-8,則輸出會是亂碼。且此時若不加ignore,會報錯。

content = f.read().decode('utf-8','ignore')
[
{'title': 'Ծͤijɹζʵʧ?', 'created_at': '2016-04-25 14:28:18', 'url': 'http://money.163.com/16/0425/14/BLGM1PH5002551G6.html'}, {'title': 'ģʽΪкò', 'created_at': '2016-04-22 15:00:23', 'url': 'http://money.163.com/16/0422/15/BL90MCB400253G87.html'},
{'title': 'ǰƾͲܿԼ', 'created_at': '2016-04-12 15:30:49', 'url': 'http://money.163.com/16/0412/15/BKFAETGB002552IJ.html'}, {'title': 'Ӱҵ˾Ƶ̧', 'created_at': '2016-03-31 13:43:27', 'url': 'http://money.163.com/16/0331/13/BJG7HME600253G87.html'}, 
{'title': 'չֵֵ֥', 'created_at': '2016-03-31 08:48:45', 'url': 'http://money.163.com/16/0331/08/BJFMM2AB00253G87.html'}, {'title': 'ձҵҵIJ', 'created_at': '2016-03-18 16:40:02', 'url': 'http://money.163.com/16/0318/16/BIF2FM7A002551G6.html'}, 
{'title': 'Աֻйľ', 'created_at': '2016-03-16 09:56:58', 'url': 'http://money.163.com/16/0316/09/BI96K6L000253G87.html'}, 
{'title': 'iPhone 6s̫ʧ? ƻ貼', 'created_at': '2016-01-26 14:45:14', 'url': 'http://money.163.com/16/0126/14/BE8V83A500253G87.html'}, 
{'title': '¼˾ε', 'created_at': '2016-01-18 16:02:05', 'url': 'http://money.163.com/16/0118/16/BDKGF2C000253G87.html'}, 
{'title': 'ѹɼ۵ ƻ', 'created_at': '2016-01-11 14:49:43', 'url': 'http://money.163.com/16/0111/14/BD2BHH85002551G6.html'}, 
{'title': 'ƵվΪζԿ첥ʹɱ?', 'created_at': '2016-01-11 14:30:31', 'url': 'http://money.163.com/16/0111/14/BD2AEC0E002551G6.html'}, 
{'title': 'ǿСǸα⣿', 'created_at': '2016-01-05 13:51:55', 'url': 'http://money.163.com/16/0105/13/BCIPRCDP002551G6.html'}, 
{'title': 'ֻƵ ϴƣ', 'created_at': '2015-12-31 12:14:33', 'url': 'http://money.163.com/15/1231/12/BC5O9GEI002551G6.html'}, 
{'title': '2015ƻսʤΣ', 'created_at': '2015-12-29 14:55:41', 'url': 'http://money.163.com/15/1229/14/BC0SN3OC002551G6.html'}, 
{'title': 'ΪſҰǻ', 'created_at': '2015-12-19 12:31:57', 'url': 'http://money.163.com/15/1219/12/BB6SGNBI002551G6.html'}
]
Traceback (most recent call last):
  File "test.py", line 21, in <module>
    content = f.read().decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 167: invalid continuation byte

上面程式多了一些沒必要的處理邏輯,可以進行簡寫,如下:

#coding=utf-8

import urllib.request
import re

url = 'http://money.163.com/special/pinglun/'

result = []

f = urllib.request.urlopen(url)

#<meta http-equiv="Content-Type" content="text/html; charset=gbk">
#因為網頁的編碼格式是gbk,所以在解碼的時候也需要用gbk解碼
content = f.read().decode('gbk')
# content = str(f.read(),'utf-8','ignore')


#獲取所需內容的模式物件,按此模式從url從獲取對應符合的內容
pattern = re.compile(r'<div class="list_item clearfix">.*?<h2><a href="(.*?)">(.*?)</a></h2>.*?<span class="time">(.*?)</span>',re.S)
#過濾html,得到滿足上面模式的內容
basic_content = re.finditer(pattern,content)

print (basic_content)
#對初步內容進行加工,得到自己想要的title、created_at、url三個內容
for i in basic_content:
	result.append({'title':i.group(2),'created_at':i.group(3),'url':i.group(1)})
print (result)