ElasticSearch最全分詞器比較及使用方法

阿新 • • 發佈：2018-11-05

介紹：ElasticSearch 是一個基於 Lucene 的搜尋伺服器。它提供了一個分散式多使用者能力的全文搜尋引擎，基於 RESTful web 介面。Elasticsearch 是用 Java 開發的，並作為Apache許可條款下的開放原始碼釋出，是當前流行的企業級搜尋引擎。設計用於雲端計算中，能夠達到實時搜尋，穩定，可靠，快速，安裝使用方便。

Elasticsearch中，內建了很多分詞器（analyzers）。下面來進行比較下系統預設分詞器和常用的中文分詞器之間的區別。

系統預設分詞器：

1、standard 分詞器

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html

如何使用：http://www.yiibai.com/lucene/lucene_standardanalyzer.html

英文的處理能力同於StopAnalyzer.支援中文采用的方法為單字切分。他會將詞彙單元轉換成小寫形式，並去除停用詞和標點符號。

/**StandardAnalyzer分析器*/
public void standardAnalyzer(String msg){
    StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
   this.getTokens(analyzer, msg);
}

2、simple 分詞器

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simple-analyzer.html

如何使用: http://www.yiibai.com/lucene/lucene_simpleanalyzer.html

功能強於WhitespaceAnalyzer, 首先會通過非字母字元來分割文字資訊，然後將詞彙單元統一為小寫形式。該分析器會去掉數字型別的字元。

/**SimpleAnalyzer分析器*/
    public void simpleAnalyzer(String msg){
        SimpleAnalyzer analyzer = new SimpleAnalyzer(Version.LUCENE_36);
        this.getTokens(analyzer, msg);
    }

3、Whitespace 分詞器

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-analyzer.html

如何使用：http://www.yiibai.com/lucene/lucene_whitespaceanalyzer.html

僅僅是去除空格，對字元沒有lowcase化,不支援中文；並且不對生成的詞彙單元進行其他的規範化處理。

/**WhitespaceAnalyzer分析器*/
    public void whitespaceAnalyzer(String msg){
        WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_36);
        this.getTokens(analyzer, msg);
    }

4、Stop 分詞器

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-analyzer.html

如何使用：http://www.yiibai.com/lucene/lucene_stopanalyzer.html

StopAnalyzer的功能超越了SimpleAnalyzer，在SimpleAnalyzer的基礎上增加了去除英文中的常用單詞（如the，a等），也可以更加自己的需要設定常用單詞；不支援中文

/**StopAnalyzer分析器*/
   public void stopAnalyzer(String msg){
       StopAnalyzer analyzer = new StopAnalyzer(Version.LUCENE_36);
       this.getTokens(analyzer, msg);
   }

5、keyword 分詞器

KeywordAnalyzer把整個輸入作為一個單獨詞彙單元，方便特殊型別的文字進行索引和檢索。針對郵政編碼，地址等文字資訊使用關鍵詞分詞器進行索引項建立非常方便。

6、pattern 分詞器

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html

一個pattern型別的analyzer可以通過正則表示式將文字分成"terms"(經過token Filter 後得到的東西 )。接受如下設定:

一個 pattern analyzer 可以做如下的屬性設定:

lowercaseterms是否是小寫. 預設為 true 小寫.pattern正則表示式的pattern, 預設是 \W+.flags正則表示式的flagsstopwords一個用於初始化stop filter的需要stop 單詞的列表.預設單詞是空的列表

7、language 分詞器

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html

一個用於解析特殊語言文字的analyzer集合。（ arabic,armenian, basque, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french,galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian,persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.）可惜沒有中文。不予考慮

8、snowball 分詞器

一個snowball型別的analyzer是由standard tokenizer和standard filter、lowercase filter、stop filter、snowball filter這四個filter構成的。

snowball analyzer 在Lucene中通常是不推薦使用的。

9、Custom 分詞器

是自定義的analyzer。允許多個零到多個tokenizer，零到多個 Char Filters. custom analyzer 的名字不能以 "_"開頭.

The following are settings that can be set for a custom analyzer type:

SettingDescriptiontokenizer通用的或者註冊的tokenizer.filter通用的或者註冊的token filterschar_filter通用的或者註冊的 character filtersposition_increment_gap距離查詢時，最大允許查詢的距離，預設是100

自定義的模板：

index :
    analysis :
        analyzer :
            myAnalyzer2 :
                type : custom
                tokenizer : myTokenizer1
                filter : [myTokenFilter1, myTokenFilter2]
                char_filter : [my_html]
                position_increment_gap: 256
        tokenizer :
            myTokenizer1 :
                type : standard
                max_token_length : 900
        filter :
            myTokenFilter1 :
                type : stop
                stopwords : [stop1, stop2, stop3, stop4]
            myTokenFilter2 :
                type : length
                min : 0
                max : 2000
        char_filter :
              my_html :
                type : html_strip
                escaped_tags : [xxx, yyy]
                read_ahead : 1024

10、fingerprint 分詞器

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-fingerprint-analyzer.html

中文分詞器：

1、ik-analyzer

https://github.com/wks/ik-analyzer

IKAnalyzer是一個開源的，基於java語言開發的輕量級的中文分詞工具包。

採用了特有的“正向迭代最細粒度切分演算法“，支援細粒度和最大詞長兩種切分模式；具有83萬字/秒（1600KB/S）的高速處理能力。

採用了多子處理器分析模式，支援：英文字母、數字、中文詞彙等分詞處理，相容韓文、日文字元

優化的詞典儲存，更小的記憶體佔用。支援使用者詞典擴充套件定義

針對Lucene全文檢索優化的查詢分析器IKQueryParser(作者吐血推薦)；引入簡單搜尋表示式，採用歧義分析演算法優化查詢關鍵字的搜尋排列組合，能極大的提高Lucene檢索的命中率。

Maven用法：

<dependency>
    <groupId>org.wltea.ik-analyzer</groupId>
    <artifactId>ik-analyzer</artifactId>
    <version>3.2.8</version>
</dependency>

在IK Analyzer加入Maven Central Repository之前，你需要手動安裝，安裝到本地的repository，或者上傳到自己的Maven repository伺服器上。

要安裝到本地Maven repository，使用如下命令，將自動編譯，打包並安裝： mvn install -Dmaven.test.skip=true

Elasticsearch新增中文分詞

安裝IK分詞外掛

https://github.com/medcl/elasticsearch-analysis-ik

進入elasticsearch-analysis-ik-master

2、如何在Elasticsearch中安裝中文分詞器(IK+pinyin)：http://www.cnblogs.com/xing901022/p/5910139.html

3、Elasticsearch 中文分詞器 IK 配置和使用： http://blog.csdn.net/jam00/article/details/52983056

ik 帶有兩個分詞器

ik_max_word：會將文字做最細粒度的拆分；儘可能多的拆分出詞語

ik_smart：會做最粗粒度的拆分；已被分出的詞語將不會再次被其它詞語佔有

區別：

# ik_max_word

curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_max_word' -d '聯想是全球最大的筆記本廠商'
#返回

{
  "tokens" : [
    {
      "token" : "聯想",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "全球",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "最大",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "的",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "筆記本",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "筆記",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "本廠",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "廠商",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 8
    }
  ]
}


# ik_smart

curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_smart' -d '聯想是全球最大的筆記本廠商'

# 返回

{
  "tokens" : [
    {
      "token" : "聯想",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "全球",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "最大",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "的",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "筆記本",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "廠商",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 6
    }
  ]
}

下面我們來建立一個索引，使用 ik 建立一個名叫 iktest 的索引，設定它的分析器用 ik ，分詞器用 ik_max_word，並建立一個 article 的型別，裡面有一個 subject 的欄位，指定其使用 ik_max_word 分詞器

curl -XPUT 'http://localhost:9200/iktest?pretty' -d '{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "ik" : {
                    "tokenizer" : "ik_max_word"
                }
            }
        }
    },
    "mappings" : {
        "article" : {
            "dynamic" : true,
            "properties" : {
                "subject" : {
                    "type" : "string",
                    "analyzer" : "ik_max_word"
                }
            }
        }
    }
}'

批量新增幾條資料，這裡我指定元資料 _id 方便檢視，subject 內容為我隨便找的幾條新聞的標題

curl -XPOST http://localhost:9200/iktest/article/_bulk?pretty -d '
{ "index" : { "_id" : "1" } }
{"subject" : "＂閨蜜＂崔順實被韓檢方傳喚 韓總統府促徹查真相" }
{ "index" : { "_id" : "2" } }
{"subject" : "韓舉行＂護國訓練＂ 青瓦臺:決不許國家安全出問題" }
{ "index" : { "_id" : "3" } }
{"subject" : "媒體稱FBI已經取得搜查令 檢視希拉里電郵" }
{ "index" : { "_id" : "4" } }
{"subject" : "村上春樹獲安徒生獎 演講中談及歐洲排外問題" }
{ "index" : { "_id" : "5" } }
{"subject" : "希拉里團隊炮轟FBI 參院民主黨領袖批其“違法”" }
'

查詢 “希拉里和韓國”

curl -XPOST http://localhost:9200/iktest/article/_search?pretty  -d'
{
    "query" : { "match" : { "subject" : "希拉里和韓國" }},
    "highlight" : {
        "pre_tags" : ["<font color='red'>"],
        "post_tags" : ["</font>"],
        "fields" : {
            "subject" : {}
        }
    }
}
'
#返回
{
  "took" : 113,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.034062363,
    "hits" : [ {
      "_index" : "iktest",
      "_type" : "article",
      "_id" : "2",
      "_score" : 0.034062363,
      "_source" : {
        "subject" : "韓舉行＂護國訓練＂ 青瓦臺:決不許國家安全出問題"
      },
      "highlight" : {
        "subject" : [ "<font color=red>韓</font>舉行＂護<font color=red>國</font>訓練＂ 青瓦臺:決不許國家安全出問題" ]
      }
    }, {
      "_index" : "iktest",
      "_type" : "article",
      "_id" : "3",
      "_score" : 0.0076681254,
      "_source" : {
        "subject" : "媒體稱FBI已經取得搜查令 檢視希拉里電郵"
      },
      "highlight" : {
        "subject" : [ "媒體稱FBI已經取得搜查令 檢視<font color=red>希拉里</font>電郵" ]
      }
    }, {
      "_index" : "iktest",
      "_type" : "article",
      "_id" : "5",
      "_score" : 0.006709609,
      "_source" : {
        "subject" : "希拉里團隊炮轟FBI 參院民主黨領袖批其“違法”"
      },
      "highlight" : {
        "subject" : [ "<font color=red>希拉里</font>團隊炮轟FBI 參院民主黨領袖批其“違法”" ]
      }
    }, {
      "_index" : "iktest",
      "_type" : "article",
      "_id" : "1",
      "_score" : 0.0021509775,
      "_source" : {
        "subject" : "＂閨蜜＂崔順實被韓檢方傳喚 韓總統府促徹查真相"
      },
      "highlight" : {
        "subject" : [ "＂閨蜜＂崔順實被<font color=red>韓</font>檢方傳喚 <font color=red>韓</font>總統府促徹查真相" ]
      }
    } ]
  }
}

這裡用了高亮屬性 highlight，直接顯示到 html 中，被匹配到的字或詞將以紅色突出顯示。若要用過濾搜尋，直接將 match 改為 term 即可

熱詞更新配置

網路詞語日新月異，如何讓新出的網路熱詞（或特定的詞語）實時的更新到我們的搜尋當中呢

先用 ik 測試一下

curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_max_word' -d '
成龍原名陳港生
'
#返回
{
  "tokens" : [ {
    "token" : "成龍",
    "start_offset" : 1,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "原名",
    "start_offset" : 3,
    "end_offset" : 5,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "陳",
    "start_offset" : 5,
    "end_offset" : 6,
    "type" : "CN_CHAR",
    "position" : 2
  }, {
    "token" : "港",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "生",
    "start_offset" : 7,
    "end_offset" : 8,
    "type" : "CN_CHAR",
    "position" : 4
  } ]
}

ik 的主詞典中沒有”陳港生” 這個詞，所以被拆分了。現在我們來配置一下

修改 IK 的配置檔案：ES 目錄/plugins/ik/config/ik/IKAnalyzer.cfg.xml

修改如下：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 擴充套件配置</comment>
    <!--使用者可以在這裡配置自己的擴充套件字典 -->
    <entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
     <!--使用者可以在這裡配置自己的擴充套件停止詞字典-->
    <entry key="ext_stopwords">custom/ext_stopword.dic</entry>
    <!--使用者可以在這裡配置遠端擴充套件字典 -->
    <entry key="remote_ext_dict">http://192.168.1.136/hotWords.php</entry>
    <!--使用者可以在這裡配置遠端擴充套件停止詞字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

這裡我是用的是遠端擴充套件字典，因為可以使用其他程式呼叫更新，且不用重啟 ES，很方便；當然使用自定義的 mydict.dic 字典也是很方便的，一行一個詞，自己加就可以了

既然是遠端詞典，那麼就要是一個可訪問的連結，可以是一個頁面，也可以是一個txt的文件，但要保證輸出的內容是 utf-8 的格式

hotWords.php 的內容

$s = <<<'EOF'
陳港生
元樓
藍瘦
EOF;
header('Last-Modified: '.gmdate('D, d M Y H:i:s', time()).' GMT', true, 200);
header('ETag: "5816f349-19"');
echo $s;

ik 接收兩個返回的頭部屬性 Last-Modified 和 ETag，只要其中一個有變化，就會觸發更新，ik 會每分鐘獲取一次重啟 Elasticsearch ，檢視啟動記錄，看到了三個詞已被載入進來

再次執行上面的請求，返回, 就可以看到 ik 分詞器已經匹配到了 “陳港生” 這個詞，同理一些關於我們公司的專有名字（例如：永輝、永輝超市、永輝雲創、雲創 .... ）也可以自己手動新增到字典中去。

2、結巴中文分詞

特點：

1、支援三種分詞模式：

精確模式，試圖將句子最精確地切開，適合文字分析；
全模式，把句子中所有的可以成詞的詞語都掃描出來, 速度非常快，但是不能解決歧義；
搜尋引擎模式，在精確模式的基礎上，對長詞再次切分，提高召回率，適合用於搜尋引擎分詞。

2、支援繁體分詞

3、支援自定義詞典

3、THULAC

THULAC（THU Lexical Analyzer for Chinese）由清華大學自然語言處理與社會人文計算實驗室研製推出的一套中文詞法分析工具包，具有中文分詞和詞性標註功能。THULAC具有如下幾個特點：

能力強。利用我們整合的目前世界上規模最大的人工分詞和詞性標註中文語料庫（約含5800萬字）訓練而成，模型標註能力強大。

準確率高。該工具包在標準資料集Chinese Treebank（CTB5）上分詞的F1值可達97.3％，詞性標註的F1值可達到92.9％，與該資料集上最好方法效果相當。

速度較快。同時進行分詞和詞性標註速度為300KB/s，每秒可處理約15萬字。只進行分詞速度可達到1.3MB/s。

中文分詞工具thulac4j釋出

1、規範化分詞詞典，並去掉一些無用詞；

2、重寫DAT（雙陣列Trie樹）的構造演算法，生成的DAT size減少了8%左右，從而節省了記憶體；

3、優化分詞演算法，提高了分詞速率。

<dependency>
  <groupId>io.github.yizhiru</groupId>
  <artifactId>thulac4j</artifactId>
  <version>${thulac4j.version}</version>
</dependency>

http://www.cnblogs.com/en-heng/p/6526598.html

thulac4j支援兩種分詞模式：

SegOnly模式，只分詞沒有詞性標註；

SegPos模式，分詞兼有詞性標註。

// SegOnly mode
String sentence = "滔滔的流水，向著波士頓灣無聲逝去";
SegOnly seg = new SegOnly("models/seg_only.bin");
System.out.println(seg.segment(sentence));
// [滔滔, 的, 流水, ，, 向著, 波士頓灣, 無聲, 逝去]

// SegPos mode
SegPos pos = new SegPos("models/seg_pos.bin");
System.out.println(pos.segment(sentence));
//[滔滔/a, 的/u, 流水/n, ，/w, 向著/p, 波士頓灣/ns, 無聲/v, 逝去/v]

4、NLPIR

中科院計算所 NLPIR：http://ictclas.nlpir.org/nlpir/ (可直接線上分析中文)

下載地址：https://github.com/NLPIR-team/NLPIR

中科院分詞系統(NLPIR)JAVA簡易教程: http://www.cnblogs.com/wukongjiuwo/p/4092480.html

5、ansj分詞器

https://github.com/NLPchina/ansj_seg

這是一個基於n-Gram+CRF+HMM的中文分詞的java實現.

分詞速度達到每秒鐘大約200萬字左右（mac air下測試），準確率能達到96%以上

目前實現了.中文分詞. 中文姓名識別 .

使用者自定義詞典,關鍵字提取，自動摘要，關鍵字標記等功能可以應用到自然語言處理等方面,適用於對分詞效果要求高的各種專案.

maven 引入：

<dependency>
            <groupId>org.ansj</groupId>
            <artifactId>ansj_seg</artifactId>
            <version>5.1.1</version>
</dependency>

呼叫demo

String str = "歡迎使用ansj_seg,(ansj中文分詞)在這裡如果你遇到什麼問題都可以聯絡我.我一定盡我所能.幫助大家.ansj_seg更快,更準,更自由!" ;
 System.out.println(ToAnalysis.parse(str));

 歡迎/v,使用/v,ansj/en,_,seg/en,,,(,ansj/en,中文/nz,分詞/n,),在/p,這裡/r,如果/c,你/r,遇到/v,什麼/r,問題/n,都/d,可以/v,聯絡/v,我/r,./m,我/r,一定/d,盡我所能/l,./m,幫助/v,大家/r,./m,ansj/en,_,seg/en,更快/d,,,更/d,準/a,,,更/d,自由/a,!

6、哈工大的LTP

https://github.com/HIT-SCIR/ltp

LTP制定了基於XML的語言處理結果表示，並在此基礎上提供了一整套自底向上的豐富而且高效的中文語言處理模組（包括詞法、句法、語義等6項中文處理核心技術），以及基於動態連結庫（Dynamic Link Library, DLL）的應用程式介面、視覺化工具，並且能夠以網路服務（Web Service）的形式進行使用。

關於LTP的使用，請參考: http://ltp.readthedocs.io/zh_CN/latest/

7、庖丁解牛

下載地址：http://pan.baidu.com/s/1eQ88SZS

使用分為如下幾步：

配置dic檔案：修改paoding-analysis.jar中的paoding-dic-home.properties檔案，將“#paoding.dic.home=dic”的註釋去掉，並配置成自己dic檔案的本地存放路徑。eg：/home/hadoop/work/paoding-analysis-2.0.4-beta/dic
把Jar包匯入到專案中：將paoding-analysis.jar、commons-logging.jar、lucene-analyzers-2.2.0.jar和lucene-core-2.2.0.jar四個包匯入到專案中，這時就可以在程式碼片段中使用庖丁解牛工具提供的中文分詞技術，例如：

Analyzer analyzer = new PaodingAnalyzer(); //定義一個解析器
String text = "庖丁系統是個完全基於lucene的中文分詞系統，它就是重新建了一個analyzer，叫做PaodingAnalyzer，這個analyer的核心任務就是生成一個可以切詞TokenStream。"; <span style="font-family: Arial, Helvetica, sans-serif;">//待分詞的內容</span>
TokenStream tokenStream = analyzer.tokenStream(text, new StringReader(text)); //得到token序列的輸出流
try {
    Token t;
    while ((t = tokenStream.next()) != null)
    {
           System.out.println(t); //輸出每個token
    }
} catch (IOException e) {
    e.printStackTrace();
}

8、sogo線上分詞

sogo線上分詞采用了基於漢字標註的分詞方法，主要使用了線性鏈鏈CRF（Linear-chain CRF）模型。詞性標註模組主要基於結構化線性模型（Structured Linear Model）

線上使用地址為： http://www.sogou.com/labs/webservice/

9、word分詞

地址： https://github.com/ysc/word

word分詞是一個Java實現的分散式的中文分片語件，提供了多種基於詞典的分詞演算法，並利用ngram模型來消除歧義。能準確識別英文、數字，以及日期、時間等數量詞，能識別人名、地名、組織機構名等未登入詞。能通過自定義配置檔案來改變元件行為，能自定義使用者詞庫、自動檢測詞庫變化、支援大規模分散式環境，能靈活指定多種分詞演算法，能使用refine功能靈活控制分詞結果，還能使用詞頻統計、詞性標註、同義標註、反義標註、拼音標註等功能。提供了10種分詞演算法，還提供了10種文字相似度演算法，同時還無縫和Lucene、Solr、ElasticSearch、Luke整合。注意：word1.3需要JDK1.8

maven 中引入依賴：

<dependencies>
    <dependency>
        <groupId>org.apdplat</groupId>
        <artifactId>word</artifactId>
        <version>1.3</version>
    </dependency>
</dependencies>

ElasticSearch外掛：

1、開啟命令列並切換到elasticsearch的bin目錄
cd elasticsearch-2.1.1/bin

2、執行plugin指令碼安裝word分詞外掛：
./plugin install http://apdplat.org/word/archive/v1.4.zip

安裝的時候注意：
    如果提示：
        ERROR: failed to download
    或者
        Failed to install word, reason: failed to download
    或者
        ERROR: incorrect hash (SHA1)
    則重新再次執行命令，如果還是不行，多試兩次

如果是elasticsearch1.x系列版本，則使用如下命令：
./plugin -u http://apdplat.org/word/archive/v1.3.1.zip -i word

3、修改檔案elasticsearch-2.1.1/config/elasticsearch.yml，新增如下配置：
index.analysis.analyzer.default.type : "word"
index.analysis.tokenizer.default.type : "word"

4、啟動ElasticSearch測試效果，在Chrome瀏覽器中訪問：
http://localhost:9200/_analyze?analyzer=word&text=楊尚川是APDPlat應用級產品開發平臺的作者

5、自定義配置
修改配置檔案elasticsearch-2.1.1/plugins/word/word.local.conf

6、指定分詞演算法
修改檔案elasticsearch-2.1.1/config/elasticsearch.yml，新增如下配置：
index.analysis.analyzer.default.segAlgorithm : "ReverseMinimumMatching"
index.analysis.tokenizer.default.segAlgorithm : "ReverseMinimumMatching"

這裡segAlgorithm可指定的值有：
正向最大匹配演算法：MaximumMatching
逆向最大匹配演算法：ReverseMaximumMatching
正向最小匹配演算法：MinimumMatching
逆向最小匹配演算法：ReverseMinimumMatching
雙向最大匹配演算法：BidirectionalMaximumMatching
雙向最小匹配演算法：BidirectionalMinimumMatching
雙向最大最小匹配演算法：BidirectionalMaximumMinimumMatching
全切分演算法：FullSegmentation
最少詞數演算法：MinimalWordCount
最大Ngram分值演算法：MaxNgramScore
如不指定，預設使用雙向最大匹配演算法：BidirectionalMaximumMatching

10、jcseg分詞器

https://code.google.com/archive/p/jcseg/

11、stanford分詞器

Stanford大學的一個開源分詞工具，目前已支援漢語。

首先，去【1】下載Download Stanford Word Segmenter version 3.5.2，取得裡面的 data 資料夾，放在maven project的 src/main/resources 裡。

然後，maven依賴新增：

<properties>
        <java.version>1.8</java.version>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <corenlp.version>3.6.0</corenlp.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>${corenlp.version}</version>
        </dependency>
        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>${corenlp.version}</version>
            <classifier>models</classifier>
        </dependency>
        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>${corenlp.version}</version>
            <classifier>models-chinese</classifier>
        </dependency>
    </dependencies>

測試：

import java.util.Properties;

import edu.stanford.nlp.ie.crf.CRFClassifier;

public class CoreNLPSegment {

    private static CoreNLPSegment instance;
    private CRFClassifier         classifier;

    private CoreNLPSegment(){
        Properties props = new Properties();
        props.setProperty("sighanCorporaDict", "data");
        props.setProperty("serDictionary", "data/dict-chris6.ser.gz");
        props.setProperty("inputEncoding", "UTF-8");
        props.setProperty("sighanPostProcessing", "true");
        classifier = new CRFClassifier(props);
        classifier.loadClassifierNoExceptions("data/ctb.gz", props);
        classifier.flags.setProperties(props);
    }

    public static CoreNLPSegment getInstance() {
        if (instance == null) {
            instance = new CoreNLPSegment();
        }

        return instance;
    }

    public String[] doSegment(String data) {
        return (String[]) classifier.segmentString(data).toArray();
    }

    public static void main(String[] args) {

        String sentence = "他和我在學校裡常打桌球。";
        String ret[] = CoreNLPSegment.getInstance().doSegment(sentence);
        for (String str : ret) {
            System.out.println(str);
        }

    }

}

部落格：

https://blog.sectong.com/blog/corenlp_segment.html

http://blog.csdn.net/lightty/article/details/51766602

12、Smartcn

Smartcn為Apache2.0協議的開源中文分詞系統，Java語言編寫，修改的中科院計算所ICTCLAS分詞系統。很早以前看到Lucene上多了一箇中文分詞的contribution，當時只是簡單的掃了一下.class檔案的檔名，通過檔名可以看得出又是一個改的ICTCLAS的分詞系統。

http://lucene.apache.org/core/5_1_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html

13、pinyin 分詞器

pinyin分詞器可以讓使用者輸入拼音，就能查詢到相關的關鍵詞。比如在某個商城搜尋中，輸入yonghui，就能匹配到永輝。這樣的體驗還是非常好的。

pinyin分詞器的安裝與IK是一樣的。下載地址：https://github.com/medcl/elasticsearch-analysis-pinyin

一些引數請參考 GitHub 的 readme 文件。

這個分詞器在1.8版本中，提供了兩種分詞規則：

pinyin,就是普通的把漢字轉換成拼音；
pinyin_first_letter，提取漢字的拼音首字母

使用：

1.Create a index with custom pinyin analyzer

curl -XPUT http://localhost:9200/medcl/ -d'
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin"
                    }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : false,
                    "keep_full_pinyin" : true,
                    "keep_original" : true,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term" : true
                }
            }
        }
    }
}'

2.Test Analyzer, analyzing a chinese name, such as 劉德華

http://localhost:9200/medcl/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e&analyzer=pinyin_analyzer

{
  "tokens" : [
    {
      "token" : "liu",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "de",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "hua",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "劉德華",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "ldh",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 4
    }
  ]
}

3.Create mapping

curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'
{
    "folks": {
        "properties": {
            "name": {
                "type": "keyword",
                "fields": {
                    "pinyin": {
                        "type": "text",
                        "store": "no",
                        "term_vector": "with_offsets",
                        "analyzer": "pinyin_analyzer",
                        "boost": 10
                    }
                }
            }
        }
    }
}'

4.Indexing

curl -XPOST http://localhost:9200/medcl/folks/andy -d'{"name":"劉德華"}'

5.Let's search

http://localhost:9200/medcl/folks/_search?q=name:%E5%88%98%E5%BE%B7%E5%8D%8E
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:%e5%88%98%e5%be%b7
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:de+hua

6.Using Pinyin-TokenFilter

curl -XPUT http://localhost:9200/medcl1/ -d'
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "user_name_analyzer" : {
                    "tokenizer" : "whitespace",
                    "filter" : "pinyin_first_letter_and_full_pinyin_filter"
                }
            },
            "filter" : {
                "pinyin_first_letter_and_full_pinyin_filter" : {
                    "type" : "pinyin",
                    "keep_first_letter" : true,
                    "keep_full_pinyin" : false,
                    "keep_none_chinese" : true,
                    "keep_original" : false,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "trim_whitespace" : true,
                    "keep_none_chinese_in_first_letter" : true
                }
            }
        }
    }
}'

Token Test:劉德華張學友郭富城黎明四大天王

curl -XGET http://localhost:9200/medcl1/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e+%e5%bc%a0%e5%ad%a6%e5%8f%8b+%e9%83%ad%e5%af%8c%e5%9f%8e+%e9%bb%8e%e6%98%8e+%e5%9b%9b%e5%a4%a7%e5%a4%a9%e7%8e%8b&analyzer=user_name_analyzer

{
  "tokens" : [
    {
      "token" : "ldh",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "zxy",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "gfc",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "lm",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "sdtw",
      "start_offset" : 15,
      "end_offset" : 19,
      "type" : "word",
      "position" : 4
    }
  ]
}

7.Used in phrase query

(1)、

 PUT /medcl/
  {
      "index" : {
          "analysis" : {
              "analyzer" : {
                  "pinyin_analyzer" : {
                      "tokenizer" : "my_pinyin"
                      }
              },
              "tokenizer" : {
                  "my_pinyin" : {
                      "type" : "pinyin",
                      "keep_first_letter":false,
                      "keep_separate_first_letter" : false,
                      "keep_full_pinyin" : true,
                      "keep_original" : false,
                      "limit_first_letter_length" : 16,
                      "lowercase" : true
                  }
              }
          }
      }
  }
  GET /medcl/folks/_search
  {
    "query": {"match_phrase": {
      "name.pinyin": "劉德華"
    }}
  }

(2)、

PUT /medcl/
  {
      "index" : {
          "analysis" : {
              "analyzer" : {
                  "pinyin_analyzer" : {
                      "tokenizer" : "my_pinyin"
                      }
              },
              "tokenizer" : {
                  "my_pinyin" : {
                      "type" : "pinyin",
                      "keep_first_letter":false,
                      "keep_separate_first_letter" : true,
                      "keep_full_pinyin" : false,
                      "keep_original" : false,
                      "limit_first_letter_length" : 16,
                      "lowercase" : true
                  }
              }
          }
      }
  }

  POST /medcl/folks/andy
  {"name":"劉德華"}

  GET /medcl/folks/_search
  {
    "query": {"match_phrase": {
      "name.pinyin": "劉德h"
    }}
  }

  GET /medcl/folks/_search
  {
    "query": {"match_phrase": {
      "name.pinyin": "劉dh"
    }}
  }

  GET /medcl/folks/_search
  {
    "query": {"match_phrase": {
      "name.pinyin": "dh"
    }}
  }

14、Mmseg 分詞器

也支援 Elasticsearch

下載地址：https://github.com/medcl/elasticsearch-analysis-mmseg/releases 根據對應的版本進行下載

如何使用：

1、建立索引：

curl -XPUT http://localhost:9200/index

2、建立 mapping

curl -XPOST http://localhost:9200/index/fulltext/_mapping -d'
{
        "properties": {
            "content": {
                "type": "text",
                "term_vector": "with_positions_offsets",
                "analyzer": "mmseg_maxword",
                "search_analyzer": "mmseg_maxword"
            }
        }

}'

3.Indexing some docs

curl -XPOST http://localhost:9200/index/fulltext/1 -d'
{"content":"美國留給伊拉克的是個爛攤子嗎"}
'

curl -XPOST http://localhost:9200/index/fulltext/2 -d'
{"content":"公安部：各地校車將享最高路權"}
'

curl -XPOST http://localhost:9200/index/fulltext/3 -d'
{"content":"中韓漁警衝突調查：韓警平均每天扣1艘中國漁船"}
'

curl -XPOST http://localhost:9200/index/fulltext/4 -d'
{"content":"中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"}
'

4.Query with highlighting(查詢高亮)

curl -XPOST http://localhost:9200/index/fulltext/_search  -d'
{
    "query" : { "term" : { "content" : "中國" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}
'

5、結果：

{
    "took": 14,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 2,
        "hits": [
            {
                "_index": "index",
                "_type": "fulltext",
                "_id": "4",
                "_score": 2,
                "_source": {
                    "content": "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
                },
                "highlight": {
                    "content": [
                        "<tag1>中國</tag1>駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首 "
                    ]
                }
            },
            {
                "_index": "index",
                "_type": "fulltext",
                "_id": "3",
                "_score": 2,
                "_source": {
                    "content": "中韓漁警衝突調查：韓警平均每天扣1艘中國漁船"
                },
                "highlight": {
                    "content": [
                        "均每天扣1艘<tag1>中國</tag1>漁船 "
                    ]
                }
            }
        ]
    }
}

參考部落格：

為elastic新增中文分詞: http://blog.csdn.net/dingzfang/article/details/42776693

15、bosonnlp （玻森資料中文分析器）

下載地址：https://github.com/bosondata/elasticsearch-analysis-bosonnlp

如何使用：

執行 ElasticSearch 之前需要在 config 資料夾中修改 elasticsearch.yml 來定義使用玻森中文分析器，並填寫玻森 API_TOKEN 以及玻森分詞 API 的地址，即在該檔案結尾處新增：

index:
  analysis:
    analyzer:
      bosonnlp:
          type: bosonnlp
          API_URL: http://api.bosonnlp.com/tag/analysis
          # You MUST give the API_TOKEN value, otherwise it doesn't work
          API_TOKEN: *PUT YOUR API TOKEN HERE*
          # Please uncomment if you want to specify ANY ONE of the following
          # areguments, otherwise the DEFAULT value will be used, i.e.,
          # space_mode is 0,
          # oov_level is 3,
          # t2s is 0,
          # special_char_conv is 0.
          # More detials can be found in bosonnlp docs:
          # http://docs.bosonnlp.com/tag.html
          #
          #
          # space_mode: put your value here(range from 0-3)
          # oov_level: put your value here(range from 0-4)
          # t2s: put your value here(range from 0-1)
          # special_char_conv: put your value here(range from 0-1)

需要注意的是

必須在 API_URL 填寫給定的分詞地址以及在API_TOKEN：PUT YOUR API TOKEN HERE中填寫給定的玻森資料API_TOKEN，否則無法使用玻森中文分析器。該 API_TOKEN 是註冊玻森資料賬號所獲得。

如果配置檔案中已經有配置過其他的 analyzer，請直接在 analyzer 下如上新增 bosonnlp analyzer。

如果有多個 node 並且都需要 BosonNLP 的分詞外掛，則每個 node 下的 yaml 檔案都需要如上安裝和設定。

另外，玻森中文分詞還提供了4個引數（space_mode，oov_level，t2s，special_char_conv）可滿足不同的分詞需求。如果取預設值，則無需任何修改；否則，可取消對應引數的註釋並賦值。

測試：

建立 index

curl -XPUT 'localhost:9200/test'

測試分析器是否配置成功

curl -XGET 'localhost:9200/test/_analyze?analyzer=bosonnlp&pretty' -d '這是玻森資料分詞的測試'

結果

{
  "tokens" : [ {
    "token" : "這",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "是",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "玻森",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "資料",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "分詞",
    "start_offset" : 6,
    "end_offset" : 8,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "的",
    "start_offset" : 8,
    "end_offset" : 9,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "測試",
    "start_offset" : 9,
    "end_offset" : 11,
    "type" : "word",
    "position" : 6
  } ]
}

配置 Token Filter

現有的 BosonNLP 分析器沒有內建 token filter，如果有過濾 Token 的需求，可以利用 BosonNLP Tokenizer 和 ES 提供的 token filter 搭建定製分析器。

步驟

配置定製的 analyzer 有以下三個步驟：

新增 BosonNLP tokenizer 在 elasticsearch.yml 檔案中 analysis 下新增 tokenizer，並在 tokenizer 中新增 BosonNLP tokenizer 的配置：

index:
  analysis:
    analyzer:
      ...
    tokenizer:
      bosonnlp:
          type: bosonnlp
          API_URL: http://api.bosonnlp.com/tag/analysis
          # You MUST give the API_TOKEN value, otherwise it doesn't work
          API_TOKEN: *PUT YOUR API TOKEN HERE*
          # Please uncomment if you want to specify ANY ONE of the following
          # areguments, otherwise the DEFAULT value will be used, i.e.,
          # space_mode is 0,
          # oov_level is 3,
          # t2s is 0,
          # special_char_conv is 0.
          # More detials can be found in bosonnlp docs:
          # http://docs.bosonnlp.com/tag.html
          #
          #
          # space_mode: put your value here(range from 0-3)
          # oov_level: put your value here(range from 0-4)
          # t2s: put your value here(range from 0-1)
          # special_char_conv: put your value here(range from 0-1)

新增 token filter

在 elasticsearch.yml 檔案中 analysis 下新增 filter，並在 filter 中新增所需 filter 的配置（下面例子中，我們以 lowercase filter 為例）：

index:
  analysis:
    analyzer:
      ...
    tokenizer:
      ...
    filter:
      lowercase:
          type: lowercase

新增定製的 analyzer

在 elasticsearch.yml 檔案中 analysis 下新增 analyzer，並在 analyzer 中新增定製的 analyzer 的配置（下面例子中，我們把定製的 analyzer 命名為 filter_bosonnlp）：

index:
  analysis:
    analyzer:
      ...
      filter_bosonnlp:
          type: custom
          tokenizer: bosonnlp
          filter: [lowercase]

自定義分詞器

雖然Elasticsearch帶有一些現成的分析器，然而在分析器上Elasticsearch真正的強大之處在於，你可以通過在一個適合你的特定資料的設定之中組合字元過濾器、分詞器、詞彙單元過濾器來建立自定義的分析器。

字元過濾器：

字元過濾器用來整理一個尚未被分詞的字串。例如，如果我們的文字是HTML格式的，它會包含像<p>或者<div>這樣的HTML標籤，這些標籤是我們不想索引的。我們可以使用 html清除字元過濾器來移除掉所有的HTML標籤，並且像把Á轉換為相對應的Unicode字元 Á 這樣，轉換HTML實體。

一個分析器可能有0個或者多個字元過濾器。

分詞器:

一個分析器必須有一個唯一的分詞器。分詞器把字串分解成單個詞條或者詞彙單元。標準分析器裡使用的標準分詞器把一個字串根據單詞邊界分解成單個詞條，並且移除掉大部分的標點符號，然而還有其他不同行為的分詞器存在。

詞單元過濾器:

經過分詞，作為結果的詞單元流會按照指定的順序通過指定的詞單元過濾器。

詞單元過濾器可以修改、新增或者移除詞單元。我們已經提到過 lowercase 和 stop 詞過濾器，但是在 Elasticsearch 裡面還有很多可供選擇的詞單元過濾器。詞幹過濾器把單詞遏制為詞幹。 ascii_folding 過濾器移除變音符，把一個像 "très" 這樣的詞轉換為 "tres" 。 ngram 和 edge_ngram 詞單元過濾器可以產生適合用於部分匹配或者自動補全的詞單元。

建立一個自定義分析器

我們可以在 analysis 下的相應位置設定字元過濾器、分詞器和詞單元過濾器:

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": { ... custom character filters ... },
            "tokenizer":   { ...    custom tokenizers     ... },
            "filter":      { ...   custom token filters   ... },
            "analyzer":    { ...    custom analyzers      ... }
        }
    }
}

這個分析器可以做到下面的這些事:

1、使用 html清除字元過濾器移除HTML部分。

2、使用一個自定義的對映字元過濾器把 & 替換為 "和" ：

"char_filter": {
    "&_to_and": {
        "type":       "mapping",
        "mappings": [ "&=> and "]
    }
}

3、使用標準分詞器分詞。

4、小寫詞條，使用小寫詞過濾器處理。

5、使用自定義停止詞過濾器移除自定義的停止詞列表中包含的詞：

"filter": {
    "my_stopwords": {
        "type":        "stop",
        "stopwords": [ "the", "a" ]
    }
}

我們的分析器定義用我們之前已經設定好的自定義過濾器組合了已經定義好的分詞器和過濾器：

"analyzer": {
    "my_analyzer": {
        "type":           "custom",
        "char_filter":  [ "html_strip", "&_to_and" ],
        "tokenizer":      "standard",
        "filter":       [ "lowercase", "my_stopwords" ]
    }
}

彙總起來，完整的建立索引請求看起來應該像這樣：

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type":       "mapping",
                    "mappings": [ "&=> and "]
            }},
            "filter": {
                "my_stopwords": {
                    "type":       "stop",
                    "stopwords": [ "the", "a" ]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type":         "custom",
                    "char_filter":  [ "html_strip", "&_to_and" ],
                    "tokenizer":    "standard",
                    "filter":       [ "lowercase", "my_stopwords" ]
            }}
}}}

索引被建立以後，使用 analyze API 來測試這個新的分析器：

GET /my_index/_analyze?analyzer=my_analyzer
The quick & brown fox

下面的縮略結果展示出我們的分析器正在正確地執行：

{
  "tokens" : [
      { "token" :   "quick",    "position" : 2 },
      { "token" :   "and",      "position" : 3 },
      { "token" :   "brown",    "position" : 4 },
      { "token" :   "fox",      "position" : 5 }
    ]
}

這個分析器現在是沒有多大用處的，除非我們告訴 Elasticsearch在哪裡用上它。我們可以像下面這樣把這個分析器應用在一個 string 欄位上：

PUT /my_index/_mapping/my_type
{
    "properties": {
        "title": {
            "type":      "string",
            "analyzer":  "my_analyzer"
        }
    }
}

最後

整理參考網上資料，如有不正確的地方還請多多指教！