elasticsearch 安裝Ik+pinyin分詞配置詳解
一、拼音分詞的應用
拼音分詞在日常生活中其實很常見,也許你每天都在用。開啟淘寶看一看吧,輸入拼音”zhonghua”,下面會有包含”zhonghua”對應的中文”中華”的商品的提示:
拼音分詞是根據輸入的拼音提示對應的中文,通過拼音分詞提升搜尋體驗、加快搜索速度。下面介紹如何在Elasticsearch 5.1.1中配置和實現pinyin+iK分詞。
二、IK分詞器下載與安裝
關於IK分詞器的介紹不再多少,一言以蔽之,IK分詞是目前使用非常廣泛分詞效果比較好的中文分詞器。做ES開發的,中文分詞十有八九使用的都是IK分詞器。
下載地址:https://github.com/medcl/elasticsearch-analysis-ik
下載之後進入到elasticsearch-analysis-pinyin-master目錄,mvn打包(沒有安裝maven的自行安裝),執行命令:
mvn package
- 1
打包成功以後,會生成一個target資料夾,在elasticsearch-analysis-ik-master/target/releases目錄下,找到elasticsearch-analysis-ik-5.1.1.zip,這就是我們需要的安裝檔案。解壓elasticsearch-analysis-ik-5.1.1.zip,得到下面內容:
commons-codec-1.9.jar commons-logging-1.2.jar config elasticsearch-analysis-ik-5.1.1.jar httpclient-4.5.2.jar httpcore-4.4.4.jar plugin-descriptor.properties
- 1
- 2
- 3
- 4
- 5
- 6
- 7
然後在elasticsearch-5.1.1/plugins目錄下新建一個資料夾ik,把elasticsearch-analysis-ik-5.1.1.zip解壓後的檔案拷貝到elasticsearch-5.1.1/plugins/ik目錄下.截圖方便理解。
三、pinyin分詞器下載與安裝
安裝過程和IK一樣,下載、打包、加入ES。這裡不在重複上述步驟,給出最後配置截圖
四、分詞測試
IK和pinyin分詞配置完成以後,重啟ES。如果重啟過程中ES報錯,說明安裝有錯誤,沒有報錯說明配置成功。
4.1 IK分詞測試
建立一個索引:
curl -XPUT "http://localhost:9200/index"
- 1
測試分詞效果:
curl -XPOST "http://localhost:9200/index/_analyze?analyzer=ik_max_word&text=中華人民共和國"
- 1
分詞結果:
{
"tokens": [{
"token": "中華人民共和國",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 0
}, {
"token": "中華人民",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
}, {
"token": "中華",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 2
}, {
"token": "華人",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 3
}, {
"token": "人民共和國",
"start_offset": 2,
"end_offset": 7,
"type": "CN_WORD",
"position": 4
}, {
"token": "人民",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 5
}, {
"token": "共和國",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 6
}, {
"token": "共和",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 7
}, {
"token": "國",
"start_offset": 6,
"end_offset": 7,
"type": "CN_CHAR",
"position": 8
}, {
"token": "國歌",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 9
}]
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
使用ik_smart分詞:
curl -XPOST "http://localhost:9200/index/_analyze?analyzer=ik_smart&text=中華人民共和國"
- 1
分詞結果:
{
"tokens": [{
"token": "中華人民共和國",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 0
}, {
"token": "國歌",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 1
}]
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
截圖方便理解:
4.2拼音分詞測試
測試拼音分詞:
curl -XPOST "http://localhost:9200/index/_analyze?analyzer=pinyin&text=張學友"
- 1
分詞結果:
{
"tokens": [{
"token": "zhang",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
}, {
"token": "xue",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
}, {
"token": "you",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2
}, {
"token": "zxy",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 3
}]
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
五、IK+pinyin分詞配置
5.1建立索引與分析器設定
建立一個索引,並設定index分析器相關屬性:
curl -XPUT "http://localhost:9200/medcl/" -d'
{
"index": {
"analysis": {
"analyzer": {
"ik_pinyin_analyzer": {
"type": "custom",
"tokenizer": "ik_smart",
"filter": ["my_pinyin", "word_delimiter"]
}
},
"filter": {
"my_pinyin": {
"type": "pinyin",
"first_letter": "prefix",
"padding_char": " "
}
}
}
}
}'
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
建立一個type並設定mapping:
curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'
{
"folks": {
"properties": {
"name": {
"type": "keyword",
"fields": {
"pinyin": {
"type": "text",
"store": "no",
"term_vector": "with_positions_offsets",
"analyzer": "ik_pinyin_analyzer",
"boost": 10
}
}
}
}
}
}'
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
5.2索引測試文件
索引2份測試文件。 文件1:
curl -XPOST http://localhost:9200/medcl/folks/andy -d'{"name":"劉德華"}'
- 1
文件2:
curl -XPOST http://localhost:9200/medcl/folks/tina -d'{"name":"中華人民共和國國歌"}'
- 1
5.3測試(1)拼音分詞
下面四條命命令都可以匹配”劉德華”
curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu"
curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:de"
curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:hua"
curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh"
- 1
- 2
- 3
- 4
- 5
- 6
- 7
5.4測試(2)IK分詞測試
curl -XPOST "http://localhost:9200/medcl/_search?pretty" -d'
{
"query": {
"match": {
"name.pinyin": "國歌"
}
},
"highlight": {
"fields": {
"name.pinyin": {}
}
}
}'
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
返回結果:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 16.698704,
"hits" : [
{
"_index" : "medcl",
"_type" : "folks",
"_id" : "tina",
"_score" : 16.698704,
"_source" : {
"name" : "中華人民共和國國歌"
},
"highlight" : {
"name.pinyin" : [
"<em>中華人民共和國</em><em>國歌</em>"
]
}
}
]
}
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
說明IK分詞器起到了效果。
5.3測試(4)pinyin+ik分詞測試:
curl -XPOST "http://localhost:9200/medcl/_search?pretty" -d'
{
"query": {
"match": {
"name.pinyin": "zhonghua"
}
},
"highlight": {
"fields": {
"name.pinyin": {}
}
}
}'
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
返回結果:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 5.9814634,
"hits" : [
{
"_index" : "medcl",
"_type" : "folks",
"_id" : "tina",
"_score" : 5.9814634,
"_source" : {
"name" : "中華人民共和國國歌"
},
"highlight" : {
"name.pinyin" : [
"<em>中華人民共和國</em>國歌"
]
}
},
{
"_index" : "medcl",
"_type" : "folks",
"_id" : "andy",
"_score" : 2.2534127,
"_source" : {
"name" : "劉德華"
},
"highlight" : {
"name.pinyin" : [
"<em>劉德華</em>"
]
}
}
]
}
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
截圖如下:
使用pinyin分詞以後,原始的欄位搜尋要加上.pinyin字尾,搜尋原始欄位沒有返回結果: