使用nutch搭建類似百度/谷歌的搜尋引擎

阿新 • • 發佈：2019-01-14

Nutch是基於Lucene實現的搜尋引擎。包括全文搜尋和Web爬蟲。Lucene為Nutch提供了文字索引和搜尋的API。

1.有資料來源，需要為這些資料提供一個搜尋頁面。最好的方式是直接從資料庫中取出資料並用Lucene API 建立索引，因為你不需要從別的網站抓取資料。
2.沒有本地資料來源，或者資料來源非常分散的情況下，就是需要抓別人的網站，則使用Nutch。

1.安裝

1.安裝tomcat

[root@localhost ~]# wget https://archive.apache.org/dist/tomcat/tomcat-9/v9.0.1/bin/apache-tomcat-9.0.1.tar.gz 

[root@localhost ~]# tar xvzf apache-tomcat-9.0.1.tar.gz -C /usr/local/
[root@localhost ~]# cd /usr/local/
[root@localhost local]# mv apache-tomcat-9.0.1/ tomcat
[root@localhost local]# /usr/local/tomcat/bin/startup.sh

2.部署nutch
這裡nutch用1.2版本，雖然現在已經很高版本了，但是1.2以上已經沒有war包，沒法做類似百度這種頁面的搜尋了，而是nutch轉而給solr提供搜尋支援。

[root@localhost 
 ~]# wget http://archive.apache.org/dist/nutch/apache-nutch-1.2-bin.tar.gz
[root@localhost ~]# tar xvf apache-nutch-1.2-bin.tar.gz -C /usr/local/
[root@localhost ~]# cd /usr/local/nutch-1.2/
[root@localhost local]# mv nutch-1.2/ nutch
[root@localhost local]# cd nutch/
[root@localhost nutch]# cp nutch-1.2.war /usr/local/tomcat/webapps/nutch.war

apache下，當瀏覽器訪問 http://localhost:8080/nutch 時nutch的war包會被自動解壓部署。可以看到我們的搜尋頁面

2.爬取資料

nutch目錄下，新建檔案url.txt，把我們要抓的網站填入，內容

https://www.hicool.top/

有個過濾規則，我們上一步填入的網站，需要經過這個規則過濾才可抓取，否則不能。修改過濾規則，檢視conf/craw-urlfilter.txt檔案

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

這其實是一個正則表示式，把加號那一行，改為僅僅允許自己網站通過

+^http://([a-z0-9]*\.)*hicool.top/

這樣可以只把自己的網站抓下來了。修改conf/nutch-site.xml檔案，在configuration標籤內增加如下索引目錄屬性，指定檢索器讀取資料的路徑。另外增加一個http.agent.name和一個http.robots.agents節點，否則不能抓取。因為nutch遵守了 robots協議，在爬行人家網站的時候，把自己的資訊提交給被爬行的網站以供識別。

<property>
    <name>http.agent.name</name>
    <value>hicool.top</value>
    <description>Hello，welcom to visit www.hicool.top</description>
</property>
<property>
    <name>http.robots.agents</name>
    <value>hicool.top,*</value>
</property>
<property>
    <name>searcher.dir</name>
    <value>/usr/local/nutch/crawl</value>
    <description></description>
</property>

searcher.dir是指定搜尋結果存放路徑。http.agent.name的value隨便填一個，而http.robots.agents的value必須填你的的http.agent.name的值，否則報錯”Your ‘http.agent.name’ value should be listed first in ‘http.robots.agents’ property”。

注意：預設不開啟對https網站抓取的支援，如果要開啟，新增如下內容到nutch-site.xml

<property>
  <name>plugin.includes</name>
  <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-jsoup</value>
</property>

這實際是使用了protocol-httpclient外掛下載https網頁，至於別的外掛都是一些過濾解析網頁的。添加了外掛之後，就可以爬https的網站了。目前已有的協議及支撐外掛如下：

http:
    protocol-http
    protocol-httpclient
https:
    protocol-httpclient
ftp:
    protocol-ftp
file:
    protocol-file

Nutch 的爬蟲有兩種方式
• 爬行企業內部網(Intranet crawling)。針對少數網站進行，用 crawl 命令。
• 爬行整個網際網路。使用低層的 inject, generate, fetch 和 updatedb 命令，具有更強的可控制性。

我們使用crawl命令，抓資料

[[email protected] nutch]# bin/nutch crawl url.txt -dir crawl -depth 10 -topN 100
crawl started in: crawl
rootUrlDir = url.txt
threads = 10
......
......
......
IndexMerger: merging indexes to: crawl/index
Adding file:/usr/local/nutch/crawl/indexes/part-00000
IndexMerger: finished at 2017-10-19 19:59:50, elapsed: 00:00:01
crawl finished: crawl

上面的過程太長，我略過了很多。引數含義說明如下：
-dir 指定存放爬行結果的目錄，本次抓取結果資料存放到sports目錄中；
-depth 表明需要抓取的頁面深度，本次抓取深度為10層；
-topN 表明只抓取前N個url，本次抓取為取每一層的前100個頁面；
-threads 指定Crawl採取下載的執行緒數，我用這個一直抓不到資料，就把它去掉了。

根據下載過程可以看出nutch爬取網頁並建立索引庫的過程如下：
1)插入器（Injector）向網頁資料庫新增起始根URL；
2)按照要求抓取的層數，用生成器（Generator）生成待下載任務；
3)呼叫獲取器（Fetcher），按照指定執行緒數實際下載相應頁面；
4)呼叫頁面分析器（ParseSegment），分析下載內容；
5)呼叫網頁資料庫管理工具（CrawlDb），把二級連結新增到庫中等待下載；
6)呼叫連結分析工具（LinkDb），建立反向連結；
7)呼叫索引器（Indexer），利用網頁資料庫、連結資料庫和具體下載的頁面內容，建立當前資料索引；
8)呼叫重複資料刪除器（DeleteDuplicates），刪除重複資料；
9)呼叫索引合併器（IndexMerger），把資料合併到歷史索引庫中。

本地測試下搜尋結果，搜關鍵字“1”

[[email protected] nutch]# bin/nutch org.apache.nutch.searcher.NutchBean 1
Total hits: 193
 0 20171019203949/https://www.hicool.top/
 ... Liberalman 的主頁 ...
 ......
 ......

搜到了193條資訊。剩下的我都省略顯示了。

使用Readdb工具摘要描述

[root@localhost nutch]# bin/nutch readdb crawl/crawldb/ -stats
CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls:     296
retry 0:        286
retry 1:        10
min score:      0.0
avg score:      0.009496622
max score:      1.11
status 1 (db_unfetched):        18
status 2 (db_fetched):  275
status 4 (db_redir_temp):       3
CrawlDb statistics: done

爬到了296個頁面。

3.在web頁面展示搜尋結果

修改/usr/local/tomcat/webapps/nutch/WEB-INF/classes/nutch-site.xml

<property>
    <name>http.agent.name</name>
    <value>hicool.top</value>
    <description>Hello，welcom to visit www.hicool.top</description>
</property>
<property>
    <name>http.robots.agents</name>
    <value>hicool.top,*</value>
</property>
<property>
    <name>searcher.dir</name>
    <value>/usr/local/nutch/crawl</value>
    <description></description>
</property>

把我們上一步抓取資料的存放路徑配置到tomcat下，重啟tomcat，就可以在瀏覽器中搜索了。

4.篩選連結

有些連結我們需要抓取，有些我們則需要排除掉。怎樣才能有一個篩選機制，過濾掉冗餘的連結呢？

編輯conf/regex-urlfilter.txt

# skip file: ftp: and mailto: urls
#過濾掉file：ftp等不是html協議的連結
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
#過濾掉圖片等格式的連結
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*[email protected]=] 過濾掉汗特殊字元的連結，因為要爬取更多的連結，比如含？=的連結

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#過濾掉一些特殊格式的連結
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
#接受所有的連結，這裡可以修改為只接受自己規定型別的連結
+.

# accept anything else
+^https:\/\/www\.hicool\.top\/article\/.*$

如果有哪些路徑我想排除掉，不抓取

-^https:\/\/www\.hicool\.top\/category/.*$
+^https:\/\/www\.hicool\.top\/article\/.*$

這樣/category/頁面下的都排除了。這些正則表示式列表，只要有一個滿足條件filter()方法就返回結果。

抓取動態內容

我們平常訪問網站的時候，往往有”?”以及後面帶引數，這種動態的內容預設也不抓取，需要配置。

在conf下面的2個檔案：regex-urlfilter.txt，crawl-urlfilter.txt

# skip URLs containing certain characters as probable queries, etc.
-[?*[email protected]=] （-改+）

這段意思是跳過在連線中存在? * ! @ = 的頁面，因為預設是跳過所以，在動態頁中存在？一般按照預設的是不能抓取到的。可以在上面2個檔案中都註釋掉：

# -[?*[email protected]=]

另外增加允許的一行

# accept URLs containing certain characters as probable queries, etc.
+[?=&]

意思是抓取時候允許抓取連線中帶 ? = & 這三個符號的連線
注意：兩個檔案都需要修改，因為NUTCH載入規則的順序是crawl-urlfilter.txt-> regex-urlfilter.txt

5.按詞劃分和中文分詞

看看上文最後的效果，你會發現，搜尋是按單個字來區分的，你輸入一句話，每個字都被單獨搜了一遍，導致不想關的資訊太冗餘。原來，nutch預設對中文按字劃分，而不是按詞劃分。
so，我們要達到按詞劃分以減少冗餘的目的，則：
1.修改原始碼。直接對Nutch分詞處理類進行修改，呼叫已寫好的一些分片語件進行分詞。
2.使用分詞外掛。按照Nutch的外掛編寫規則重新編寫或者新增中文分詞外掛。

這裡我使用修改原始碼方式，得下載原始碼重新編譯了。關於 IKAnalyzer3.2.8.jar 這個包，我是在網上搜到下載的。可以看這篇 https://github.com/wks/ik-analyzer 安裝此包。

[root@localhost ~]# wget http://archive.apache.org/dist/nutch/apache-nutch-1.2-src.tar.gz
[root@localhost ~]# tar xvf apache-nutch-1.2-src.tar.gz -C /usr/local/
[root@localhost ~]# cd /usr/local/
[root@localhost local]# mv apache-nutch-1.2/ nutch
[root@localhost local]# cd nutch
[root@localhost nutch]# mv ~/IKAnalyzer3.2.8.jar lib/

編輯原始碼生成檔案 src/java/org/apache/nutch/analysis/NutchAnalysis.jj

130   // chinese, japanese and korean characters
131 | <SIGRAM: <CJK> >

這是按字劃分，改為 | <SIGRAM: (<CJK>)+ >，後面那個”+”號是多次，就組成詞了。

Lucene中使用JavaCC這個Java語言分析器按照規則自動生成的原始碼。確保安裝了該工具。

[root@localhost nutch]# cd src/java/org/apache/nutch/analysis/
[root@localhost analysis]# javacc NutchAnalysis.jj

當前路徑新生成的原始碼會覆蓋掉舊的

修改NutchAnalysis.java

 49   /** Construct a query parser for the text in a reader. */
 50   public static Query parseQuery(String queryString, Configuration conf) throws IOException,ParseException {
 51     return parseQuery(queryString, null, conf);
 52   }
 53 
 54   /** Construct a query parser for the text in a reader. */
 55   public static Query parseQuery(String queryString, Analyzer analyzer, Configuration conf)
 56     throws IOException,ParseException {
 57     NutchAnalysis parser = new NutchAnalysis(
 58           queryString, (analyzer != null) ? analyzer : new NutchDocumentAnalyzer(conf));
 59     parser.queryString = queryString;
 60     parser.queryFilters = new QueryFilters(conf);
 61     return parser.parse(conf);
 62   }

這份程式碼原來是沒有ParseException這個異常處理的，給它IOException的後面加上”,ParseException”，這是我修改過後的。

修改NutchDocumentAnalyzer.java

103   /** Returns a new token stream for text from the named field. */
104   public TokenStream tokenStream(String fieldName, Reader reader) {
105     /*Analyzer analyzer;
106     if ("anchor".equals(fieldName))
107       analyzer = ANCHOR_ANALYZER;
108     else
109       analyzer = CONTENT_ANALYZER;*/
110     Analyzer analyzer = new org.wltea.analyzer.lucene.IKAnalyzer();
111 
112     return analyzer.tokenStream(fieldName, reader);
113   }

我把原來的程式碼註釋了return之前哪一行是新加的。

回到根目錄，修改build.xml，在 <target name="war" depends="jar,compile,generate-docs"></target>的<lib></lib>之間加入IKAnalyzer3.2.8.jar，使得編譯可以依賴上。

200         <include name="log4j-*.jar"/>
201         <include name="IKAnalyzer3.2.8.jar"/>
202       </lib>

開始編譯

[root@localhost nutch]# ant

編譯成功，產生一個build目錄

[root@localhost nutch]# cp build/nutch-1.2.job ./

再生產war包

[root@localhost nutch]# ant war
[root@localhost nutch]# cp build/nutch-1.2.jar ./
[root@localhost nutch]# cp build/nutch-1.2.war ./

我們的編譯就大功告成了。剩下的就是重複跟上文部署一個搜尋引擎的步驟，過程略。有一點需要說明，新的搜尋介面，輸入關鍵詞進行搜尋，這時會出現空白頁。還需要修改 /usr/local/tomcat/webapps/nutch-1.2/WEB-INF/classes/nutch-site.xml 檔案，新增載入外掛的屬性：

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(text|html|js)|analysis-(zh)|index-basic|query-(basic|site|url)|summary-lucene|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

這裡使用protocol-http而不是protocol-httpclient，需要注意。重啟後的分詞效果

可以看到已經以“設計模式”、“設計”、“模式”這些詞看分關鍵詞搜尋了，OK，成功！

問題

每次重新爬後，要重啟tomcat才能順利訪問

1. Stopping at depth=0 - no more URLs to fetch

特麼的，網上看一堆類似這麼寫的

bin/nutch crawl url.txt -dir crawl -depth 10 -topN 100 -treads 10

我照抄，結果一直報錯Stopping at depth=0 - no more URLs to fetch.害得我搜便各種各樣的辦法，改來改去，都無濟於事，過濾那個地方的正則表示式我都到別的地方去驗證了，沒問題。0.9和1.2版本換了n次，配置了一堆東西，最後自己發現， -treads 10 這個引數有大問題，帶上它怎麼都失敗，去掉立刻OK了。

2. 中文亂碼問題

配置tomcat的conf資料夾下的server.xml
修改如下

    <Connector port="8080" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443" URIEncoding="UTF-8" useBodyEncodingForURI="true"/>

找到這一段，新增URIEncoding="UTF-8" useBodyEncodingForURI="true"。
重啟一下Tomcat

參考

使用nutch搭建類似百度/谷歌的搜尋引擎

1.安裝

2.爬取資料

3.在web頁面展示搜尋結果

4.篩選連結

抓取動態內容

5.按詞劃分和中文分詞

問題

1. Stopping at depth=0 - no more URLs to fetch

2. 中文亂碼問題

使用nutch搭建類似百度/谷歌的搜尋引擎

百度谷歌蘋果們的殊途同歸：平臺化發展的必然與可能

JQUERY仿百度谷歌智慧提示

百度谷歌 Twitter，這麼多短連結服務（Short Url）到底哪家強？

百度谷歌離線地圖解決方案（離線地圖下載）

百度谷歌都作惡，但到底哪家技術更厲害？你會選擇用哪個？

百度谷歌收錄規則優化技巧 SEO …

Hexo個人部落格站點被百度谷歌收錄

百度、谷歌搜尋引擎介面

海量資料搜尋---demo展示百度、谷歌搜尋引擎的實現

Elasticsearch實現類似百度的搜尋引擎搜尋功能（下拉自動補全）

網站的SEO優化（提高搜尋引擎收錄，類似百度）

類似百度首頁搜索靜態圖

java web通過openoffice實現文件網頁預覽（類似百度文庫）

google瀏覽器谷歌搜尋引擎怎麼設定單擊在新標籤頁開啟頁面

油猴指令碼：百度網盤搜尋引擎聚合

油猴指令碼（tampermonkey）：百度網盤搜尋引擎聚合

JS實現輸入框類似百度搜索的智慧提示效果

類似百度的搜尋框的下拉的匹配使用最簡單的方式

【隨筆】網站被谷歌搜尋引擎爬取crawl-66-249-79-2

使用nutch搭建類似百度/谷歌的搜尋引擎

1.安裝

2.爬取資料

3.在web頁面展示搜尋結果

4.篩選連結

抓取動態內容

5.按詞劃分和中文分詞

問題

1. Stopping at depth=0 - no more URLs to fetch

2. 中文亂碼問題

相關推薦