scrapy實戰1分布式爬取有緣網：

阿新 • • 發佈：2017-06-11

req 年齡 dict ems arch last rem pen war

直接上代碼：

items.py

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define here the models for your scraped items
 4 #
 5 # See documentation in:
 6 # http://doc.scrapy.org/en/latest/topics/items.html
 7 
 8 import scrapy
 9 
10 
11 class YouyuanwangItem(scrapy.Item):
12     # define the fields for your item here like:
13 
     # name = scrapy.Field()
14     # 個人頭像鏈接
15     header_url=scrapy.Field()
16     # 用戶名
17     username=scrapy.Field()
18     # 內心獨白
19     monologue=scrapy.Field()
20     # 相冊圖片鏈接
21     pic_urls=scrapy.Field()
22     #籍貫
23     place_from=scrapy.Field()
24     #學歷
25     education=scrapy.Field()
 
26     # 年齡
27     age=scrapy.Field()
28     #身高
29     height=scrapy.Field()
30     #工資
31     salary=scrapy.Field()
32     #興趣愛好
33     hobby=scrapy.Field()
34     # 網站來源 youyuan
35     source=scrapy.Field()
36     # 個人主頁源url
37     source_url=scrapy.Field()
38     # 爬蟲名
39     spider=scrapy.Field()

View Code

spiders >yuoyuan.py

  1 # -*- coding: utf-8 -*-
  2 import scrapy
  3 from scrapy.linkextractors import LinkExtractor
  4 from scrapy.spiders import Rule,CrawlSpider
  5 #from scrapy_redis.spiders import RedisCrawlSpider
  6 from youyuanwang.items import YouyuanwangItem
  7 
  8 
  9 class YouyuanSpider(CrawlSpider):
 10 #class YouyuanSpider(RedisCrawlSpider):
 11     name = ‘youyuan‘
 12     allowed_domains = [‘www.youyuan.com‘]
 13     # 有緣網的列表頁
 14     start_urls = [‘http://www.youyuan.com/find/beijing/mm18-25/advance-0-0-0-0-0-0-0/p1/‘]
 15     #redis_key = ‘YouyuanSpider:start_urls‘
 16     #動態域範圍的獲取
 17     # def __init__(self, *args, **kwargs):
 18     #     # Dynamically define the allowed domains list.
 19     #     domain = kwargs.pop(‘domain‘, ‘‘)
 20     #     self.allowed_domains = filter(None, domain.split(‘,‘))
 21     #     super(YouyuanSpider, self).__init__(*args, **kwargs)
 22     #匹配全國
 23     #list_page = LinkExtractor(allow=(r‘http://www.youyuan.com/find/.+‘))
 24     # 只匹配北京、18~25歲、女性 的 搜索頁面匹配規則，根據response提取鏈接
 25     page_links=LinkExtractor(allow=r"http://www.youyuan.com/find/beijing/mm18-25/advance-0-0-0-0-0-0-0/p\d+/")
 26     # 個人主頁 匹配規則，根據response提取鏈接
 27     profile_page=LinkExtractor(allow=r"http://www.youyuan.com/\d+-profile/")
 28 
 29     rules = (
 30         # 匹配列表頁成功，跟進鏈接，跳板
 31         Rule(page_links),
 32         # 匹配個人主頁的鏈接，形成request保存到redis中等待調度，一旦有響應則調用parse_profile_page()回調函數處理，不做繼續跟進
 33         Rule(profile_page,callback="parse_profile_page",follow=False)
 34     )
 35 
 36     # 處理個人主頁信息，得到我們要的數據
 37     def parse_profile_page(self, response):
 38         item=YouyuanwangItem()
 39         # 個人頭像鏈接
 40         item[‘header_url‘]=self.get_header_url(response)
 41         # 用戶名
 42         item[‘username‘]=self.get_username(response)
 43         #籍貫
 44         item[‘place_from‘]=self.get_place_from(response)
 45         #學歷
 46         item[‘education‘]=self.get_education(response)
 47 
 48         # 年齡
 49         item[‘age‘]=self.get_age(response)
 50         # 身高
 51         item[‘height‘]=self.get_height(response)
 52         # 工資
 53         item[‘salary‘]=self.get_salary(response)
 54         # 興趣愛好
 55         item[‘hobby‘]=self.get_hobby(response)
 56         # 相冊圖片鏈接
 57         item[‘pic_urls‘] = self.get_pic_urls(response)
 58         # 內心獨白
 59         item[‘monologue‘] = self.get_monologue(response)
 60         # 個人主頁源url
 61         item[‘source_url‘]=response.url
 62         # 網站來源 youyuan
 63         item[‘source‘]="youyuan"
 64         # 爬蟲名
 65         item[‘spider‘]="youyuan"
 66         yield item
 67    #提取頭像地址
 68     def get_header_url(self,response):
 69         header=response.xpath(‘//dl[@class="personal_cen"][email protected]‘).extract()
 70         if len(header):
 71             header_url=header[0]
 72         else:
 73             header_url= ""
 74         return header_url.strip()
 75     #提取用戶名
 76     def get_username(self,response):
 77         username=response.xpath(‘//dl[@class="personal_cen"]/dd//div[@class="main"]/strong/text()‘).extract()
 78         if len(username):
 79             username=username[0]
 80         else:
 81             username=""
 82         return username.strip()
 83     #提取年齡
 84     def get_age(self,response):
 85         age=response.xpath(‘//dl[@class="personal_cen"]//p[@class="local"]/text()‘).extract()
 86         if len(age):
 87             age=age[0].split()[1]
 88         else:
 89             age=""
 90         return age
 91     #提取身高
 92     def get_height(self,response):
 93         height=response.xpath(‘//div[@class="pre_data"]/ul/li[2]/div/ol[2]/li[2]/span/text()‘).extract()
 94         if len(height):
 95             height=height[0]
 96         else:
 97             height=""
 98 
 99         return height.strip()
100     #提取工資
101     def get_salary(self,response):
102         salary=response.xpath(‘//div[@class="pre_data"]/ul/li[2]/div/ol[1]/li[4]/span/text()‘).extract()
103         if len(salary):
104             salary=salary[0]
105         else:
106             salary=""
107         return salary.strip()
108     #提取興趣愛好
109     def get_hobby(self,response):
110         hobby=response.xpath(‘//dl[@class="personal_cen"]//ol[@class="hoby"]//li/text()‘).extract()
111         if len(hobby):
112             hobby=",".join(hobby).replace(" ","")
113         else:
114             hobby=""
115         return hobby.strip()
116     #提取相冊圖片
117     def get_pic_urls(self,response):
118         pic_urls=response.xpath(‘//div[@class="ph_show"][email protected]‘).extract()
119         if len(pic_urls):
120             pic_urls=",".join(pic_urls)
121             #將相冊url列表轉換成字符串
122         else:
123             pic_urls=""
124         return pic_urls
125     #提取內心獨白
126     def get_monologue(self,response):
127         monologue=response.xpath(‘//div[@class="pre_data"]/ul/li/p/text()‘).extract()
128         if len(monologue):
129             monologue=monologue[0]
130         else:
131             monologue=""
132         return monologue.strip()
133     #提取籍貫
134     def get_place_from(self,response):
135         place_from=response.xpath(‘//div[@class="pre_data"]/ul/li[2]/div/ol[1]/li[1]/span/text()‘).extract()
136         if len(place_from):
137             place_from=place_from[0]
138         else:
139             place_from=""
140         return place_from.strip()
141     #提取學歷
142     def get_education(self,response):
143         education=response.xpath(‘//div[@class="pre_data"]/ul/li[2]/div/ol[1]/li[3]/span/text()‘).extract()
144         if len(education):
145             education=education[0]
146         else:
147             education=""
148         return education.strip()

View Code

pipelines.py

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 # import json
 8 #
 9 # class YouyuanwangPipeline(object):
10 #     def __init__(self):
11 #         self.filename=open("youyuanwang.json","wb")
12 #     def process_item(self, item, spider):
13 #         jsontext=json.dumps(dict(item),ensure_ascii=False) + ",\n"
14 #         self.filename.write(jsontext.encode("utf-8"))
15 #         return item
16 #     def close_spider(self,spider):
17 #         self.filename.close()
18 
19 import pymysql
20 from .models.es_types import YouyuanType
21 class XiciPipeline(object):
22     def process_item(self, item, spider):
23         # DBKWARGS=spider.settings.get(‘DBKWARGS‘)
24         con=pymysql.connect(host=‘127.0.0.1‘,user=‘root‘,passwd=‘229801‘,db=‘yunyuan‘,charset=‘utf8‘)
25         cur=con.cursor()
26         sql=("insert into youyuanwang(header_url,username,monologue,pic_urls,place_from,education,age,height,salary,hobby,source)"
27              "VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)")
28         lis=(item[‘header_url‘],item[‘username‘],item[‘monologue‘],item[‘pic_urls‘],item[‘place_from‘],item[‘education‘],item[‘age‘],item[‘height‘],item[‘salary‘],item[‘hobby‘],item[‘source‘])
29 
30         cur.execute(sql,lis)
31         con.commit()
32         cur.close()
33         con.close()
34         return item
35 
36 
37 
38 class ElasticsearchPipeline(object):
39     def process_item(self,item,spider):
40         youyuan = YouyuanType()
41         youyuan.header_url=item["header_url"]
42         youyuan.username=item["username"]
43         youyuan.age=item["age"]
44         youyuan.salary=item["salary"]
45         youyuan.monologue=item["monologue"]
46         youyuan.pic_urls=item["pic_urls"]
47         youyuan.place_from=item["place_from"]
48 
49         youyuan.save()
50 
51         return item

View Code

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for youyuanwang project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = ‘youyuanwang‘

SPIDER_MODULES = [‘youyuanwang.spiders‘]
NEWSPIDER_MODULE = ‘youyuanwang.spiders‘


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = ‘youyuanwang (+http://www.yourdomain.com)‘

# Obey robots.txt rules
#ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
#   ‘Accept-Language‘: ‘en‘,
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    ‘youyuanwang.middlewares.YouyuanwangSpiderMiddleware‘: 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    ‘youyuanwang.middlewares.MyCustomDownloaderMiddleware‘: 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    ‘scrapy.extensions.telnet.TelnetConsole‘: None,
#}
 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
 SCHEDULER = "scrapy_redis.scheduler.Scheduler"
 SCHEDULER_PERSIST = True

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   #‘youyuanwang.pipelines.XiciPipeline‘: 300,
   ‘youyuanwang.pipelines.ElasticsearchPipeline‘: 300,
   # ‘scrapy_redis.pipelines.RedisPipeline‘:400,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = ‘httpcache‘
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage‘

View Code

從redis保存到mongodb 在目錄下新建文件process_item_mongo.py(名字隨便取)

 1 #coding=utf-8
 2 
 3 
 4 import pymongo
 5 import redis
 6 import json
 7 
 8 def process_item():
 9     Redis_conn=redis.StrictRedis(host=‘127.0.0.1‘,port=6379,db=0)
10     Mongo_conn=pymongo.MongoClient(host=‘127.0.0.1‘,port=27017)
11     db=Mongo_conn["youyuan"]
12     table=db["beijing_18_25"]
13     while True:
14         source, data = Redis_conn.blpop(["youyuan:items"])
15         data = json.loads(data.decode("utf-8"))
16         table.insert(data)
17 if __name__=="__main__":
18     process_item()

View Code

從redis保存到mysql 在目錄下新建文件process_item_mysql.py(名字隨便取)

 1 #coding=utf-8
 2 
 3 import pymysql
 4 import redis
 5 import json
 6 
 7 def process_item():
 8     Redis_conn=redis.StrictRedis(host=‘127.0.0.1‘,port=6379,db=0)
 9     MySql_conn=pymysql.connect(host=‘127.0.0.1‘,user=‘root‘,passwd=‘229801‘,port=3306,db=‘yunyuan‘)
10     while True:
11         source,data=Redis_conn.blpop("youyuan:items")
12         data=json.loads(data.decode("utf-8"))
13         cur=MySql_conn.cursor()
14         sql=("insert into youyuanwang(header_url,username,monologue,pic_urls,place_from,education,age,height,salary,hobby,source)"
15              "VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)")
16         lis = (data[‘header_url‘], data[‘username‘], data[‘monologue‘], data[‘pic_urls‘], data[‘place_from‘],
17                data[‘education‘], data[‘age‘], data[‘height‘], data[‘salary‘], data[‘hobby‘], data[‘source‘])
18         cur.execute(sql,lis)
19         MySql_conn.commit()
20         cur.close()
21         MySql_conn.close()
22     if __name__=="__main__":
23         process_item()

View Code

數據：

技術分享

申明：以上只限於參考學習交流！！！

scrapy實戰1分布式爬取有緣網：

req 年齡 dict ems arch last rem pen war 直接上代碼： items.py 1 # -*- coding: utf-8 -*- 2 3 # Define here the models for your scraped items

中間件(1)分布式緩存

cached 常用多線程處理類型使用 memcach lis ron 以及　　為了提高網站性能，一般都會使用到緩存，緩存的數據源包括數據庫，外部接口等，緩存一般分為兩種，本地緩存和分布式緩存，這裏主要總結的是分布式緩存。 Memcached和Redis 最常用的

1.分布式與集中式(ZooKeeper手記)

客戶端通信 keep 網絡分布中心軟件數據丟失通過 1. 集中式特點由一臺或多臺主計算機組成的中心節點，數據集中存儲在這個中間節點上，並且整個系統的所有業務單元都集中部署在這個中心節點上，系統所有功能均有其集中處理。簡言之：終端或客戶端只負責數據的輸入輸出，

2018年最新Python3.6網絡爬蟲實戰案例基礎+實戰+框架+分布式高清視頻教程

用戶學員知乎應該多版本 middle 選擇 con 則表達式課程簡介: 這是一套目前為止我覺得最適合小白學習的體系非常完整的Python爬蟲課程，使用的Python3.6的版本，用到anaconda來開發python程序，老師講解的很細致，課程體系設置的也

20182017年最新Python3.6網絡爬蟲實戰案例基礎+實戰+框架+分布式高清視頻教程

適合則表達式 pos flask 移動端 item redis源碼環境配置過程課程簡介: 這是一套目前為止我覺得最適合小白學習的體系非常完整的Python爬蟲課程，使用的Python3.6的版本，用到anaconda來開發python程序，老師講解的很細致，

Flink（二）CentOS7.5搭建Flink1.6.1分布式集群

驗證 sin yarn paths sla dash eight specified oca 一. Flink的下載安裝包下載地址：http://flink.apache.org/downloads.html ，選擇對應Hadoop的Flink版本下載 [a

1.1分布式-分布式概念

微服務小型機制路由定時 -a servlet 3年 sta 什麽是分布式？ 1. 任務分解 2. 節點通信分布式和集群的關系？電商平臺：用戶、商品、訂單、交易分布式：一個業務拆分成多個子系統，部署在不同的服務器上集群：同一個業務，部署在多個服務

2017年最新Python3.6網絡爬蟲實戰案例基礎+實戰+框架+分布式高清視頻教程

問題 color 令行如何使用網絡能力小白 lib line 課程簡介: 這是一套目前為止我覺得最適合小白學習的體系非常完整的Python爬蟲課程，使用的Python3.6的版本，用到anaconda來開發python程序，老師講解的很細致，課程體系設置的也

JMeter全程實戰、性能測試實戰、分布式性能測試、真實案例分析

信息結束例如日誌賬戶信用卡每次計算權限測試需求描述 1、本次測試的接口為http服務端接口 2、接口的主要分成兩類，一類提供給查詢功能接口，一類提供保存數據功能接口，這裏我們舉例2個保存數據的接口，因為這兩個接口有關聯性，比較有代表性；保存信用卡

6 scrapy框架之分布式操作

raw start isp page 其他 set 分布式爬蟲 d+ sed 分布式爬蟲一.redis簡單回顧　　1.啟動redis：　　　　mac/linux: redis-server redis.conf　　　　windows: redis-server.exe

重磅發布-SpringBoot實戰實現分布式鎖視頻教程

少見 zookeep sync bit zookeepe com 技術註冊利用概要介紹：歷經一個月的時間，我錄制的分布式鎖實戰之SpringBoot實戰實現系列完整視頻教程終於出世了！在本課程中，我分享介紹了分布式鎖出現的背景、實現方式以及將其應用到實際的業務場景中，

SpringBoot實戰實現分布式鎖一之重現多線程高並發場景

-a 數據庫表創建 book 前言分解 bind result 上下實戰前言：上篇博文我總體介紹了我這套視頻課程：“SpringBoot實戰實現分布式鎖” 總體涉及的內容，從本篇文章開始，我將開始介紹其中涉及到的相關知識要點，感興趣的小夥伴可以關註關註學習學習！！工欲

分布式存儲雜談之一：特點、難點和疑點

全面配置高端存儲當前之一擴容同時直接復制當前分布式存儲很火，筆者也有機會投身到了這股洪流。下面就結合這段時間的工作，分專題簡要總結一下一些感想。分布式存儲的特點擴展性好，支持橫向擴展，擴容、擴性能直接加機器就好；無需特別硬件支持，區別於一些中高

新一代分布式任務調度框架：當當elastic-job開源項目的10

判斷 oba 早期 cep gin bean 社區之前都是原文地址：https://www.cnblogs.com/qiumingcheng/p/5573845.html 作者簡介：張亮，當當網架構師、當當技術委員會成員、消息中間件組負責人。對架構設計、分布式、

Scrapy分布式爬蟲打造搜索引擎（慕課網）--爬取知乎（二）

false pat 模塊 text 文件的服務協議 .py execute 通過Scrapy模擬登陸知乎通過命令讓系統自動新建zhihu.py文件首先進入工程目錄下再進入虛擬環境通過genspider命令新建zhihu.py scrap

分布式爬蟲系統設計、實現與實戰：爬取京東、蘇寧易購全網手機商品數據+MySQL、HBase存儲

大數據分布式爬蟲 Java Redis [TOC] 1 概述在不用爬蟲框架的情況，經過多方學習，嘗試實現了一個分布式爬蟲系統，並且可以將數據保存到不同地方，類似MySQL、HBase等。基於面向接口的編碼思想來開發，因此這個系統具有一定的擴展性，有興趣的朋友直接看一下代碼，就能理

Scrapy-redis改造scrapy實現分布式多進程爬取

ads 爬取 eml rip push pri ruby lis article 一.基本原理： Scrapy-Redis則是一個基於Redis的Scrapy分布式組件。它利用Redis對用於爬取的請求(Requests)進行存儲和調度(Schedule)，並對爬取產生的項

ceph分布式存儲實戰（1）——ceph集群測試主機規劃

monit dep release host eas rst 存儲實戰 hostname 主機規劃節點磁盤（4塊）網卡（2塊）mem/cpuOSHostName節點1os-ceph-node1/10G私Eth0:dhcp1G/1CentOS Linux release

SpringCloud分布式事務實戰（七）在微服務1中創建整合函數，調用微服務2

request enable class alt cef 內容 llb 傳遞 turn （1）添加jar pom.xml <dependency> <groupId>org.springframework.clou

[大數據]-Elasticsearch5.3.1+Kibana5.3.1從單機到分布式的安裝與使用<2>

amp fault hang 終端 bject pre 定義地理類型前言：上篇[大數據]-Elasticsearch5.3.1+Kibana5.3.1從單機到分布式的安裝與使用<1>中介紹了ES ,Kibana的單機到分布式的安裝，這裏主要是介紹Elast

scrapy實戰1分布式爬取有緣網：

相關推薦