elasticsearch6.x ik中文分詞整合

阿新 • • 發佈：2018-12-17

Elasticsearch是一個基於Apache Lucene(TM)的開源、實時分散式搜尋和分析引擎。它用於全文搜尋、結構化搜尋、分析以及將這三者混合使用。IK Analysis外掛將Lucene IK分析器整合到elasticsearch中，支援自定義詞典。

1. 選擇ik版本

IK版本安裝是由Elasticsearch版本決定的，如下圖所示。

IK版本	ES版本
主	6.x - >主人
6.3.0	6.3.0
6.2.4	6.2.4
6.1.3	6.1.3
5.6.8	5.6.8
5.5.3	5.5.3
5.4.3	5.4.3
5.3.3	5.3.3
5.2.2	5.2.2
5.1.2	5.1.2
1.10.6	2.4.6
1.9.5	2.3.5
1.8.1	2.2.1
1.7.0	2.1.1
1.5.0	2.0.0
1.2.6	1.0.0
1.2.5	0.90.x
1.1.3	0.20.x
1.0.0	0.16.2 - > 0.19.0

在ELK 6.3.1安裝與部署

中，已經介紹elasticsearch6.3.1安裝部署，因此與之對應IK版本也選擇為6.3.1。

2. 線上安裝

bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.1/elasticsearch-analysis-ik-6.3.1.zip

3. 重啟es

ps -ef | grep elasticsearch   #查詢es程序號

kill -9 **   #殺掉es程序

bin/elasticsearch -d && tail -f logs/elasticsearch.log   #重啟es，log列印

4. IK測試

ik中文分詞支援ik_smart和ik_max_word兩種方式，區別在於：

ik_max_word: 會將文字做最細粒度的拆分，比如會將“內地港澳同胞:港珠澳大橋讓港澳與國家融合更緊密”拆分為“內地、港澳同胞、港澳、同胞、港、珠、澳、大橋、讓、港澳、與國、國家、融合、更緊、緊密”，會窮盡各種可能的組合；

ik_smart: 會做最粗粒度的拆分，比如會將“內地港澳同胞:港珠澳大橋讓港澳與國家融合更緊密”拆分為“內地、港澳同胞、港、珠、澳、大橋、讓、港澳、與、國家、融合、更、緊密”。

4.1 ik_max_word分詞

輸入文字json：

curl -XGET http://lee:9200/_analyze?pretty -H 'Content-Type:application/json' -d '

{

"analyzer": "ik_max_word",

"text": "內地港澳同胞:港珠澳大橋讓港澳與國家融合更緊密"

}'

輸出分詞結果：

{

"tokens" : [

{

"token" : "內地",

"start_offset" : 0,

"end_offset" : 2,

"type" : "CN_WORD",

"position" : 0

},

{

"token" : "港澳同胞",

"start_offset" : 2,

"end_offset" : 6,

"type" : "CN_WORD",

"position" : 1

},

{

"token" : "港澳",

"start_offset" : 2,

"end_offset" : 4,

"type" : "CN_WORD",

"position" : 2

},

{

"token" : "同胞",

"start_offset" : 4,

"end_offset" : 6,

"type" : "CN_WORD",

"position" : 3

},

{

"token" : "港",

"start_offset" : 7,

"end_offset" : 8,

"type" : "CN_CHAR",

"position" : 4

},

{

"token" : "珠",

"start_offset" : 8,

"end_offset" : 9,

"type" : "CN_CHAR",

"position" : 5

},

{

"token" : "澳",

"start_offset" : 9,

"end_offset" : 10,

"type" : "CN_CHAR",

"position" : 6

},

{

"token" : "大橋",

"start_offset" : 10,

"end_offset" : 12,

"type" : "CN_WORD",

"position" : 7

},

{

"token" : "讓",

"start_offset" : 12,

"end_offset" : 13,

"type" : "CN_CHAR",

"position" : 8

},

{

"token" : "港澳",

"start_offset" : 13,

"end_offset" : 15,

"type" : "CN_WORD",

"position" : 9

},

{

"token" : "與國",

"start_offset" : 15,

"end_offset" : 17,

"type" : "CN_WORD",

"position" : 10

},

{

"token" : "國家",

"start_offset" : 16,

"end_offset" : 18,

"type" : "CN_WORD",

"position" : 11

},

{

"token" : "融合",

"start_offset" : 18,

"end_offset" : 20,

"type" : "CN_WORD",

"position" : 12

},

{

"token" : "更緊",

"start_offset" : 20,

"end_offset" : 22,

"type" : "CN_WORD",

"position" : 13

},

{

"token" : "緊密",

"start_offset" : 21,

"end_offset" : 23,

"type" : "CN_WORD",

"position" : 14

}

]

}

4.2 ik_smart分詞

輸入本文json:

curl -XGET http://lee:9200/_analyze?pretty -H 'Content-Type:application/json' -d '

{

"analyzer": "ik_smart",

"text": "內地港澳同胞:港珠澳大橋讓港澳與國家融合更緊密"

}'

輸出分詞結果：

{

"tokens" : [

{

"token" : "內地",

"start_offset" : 0,

"end_offset" : 2,

"type" : "CN_WORD",

"position" : 0

},

{

"token" : "港澳同胞",

"start_offset" : 2,

"end_offset" : 6,

"type" : "CN_WORD",

"position" : 1

},

{

"token" : "港",

"start_offset" : 7,

"end_offset" : 8,

"type" : "CN_CHAR",

"position" : 2

},

{

"token" : "珠",

"start_offset" : 8,

"end_offset" : 9,

"type" : "CN_CHAR",

"position" : 3

},

{

"token" : "澳",

"start_offset" : 9,

"end_offset" : 10,

"type" : "CN_CHAR",

"position" : 4

},

{

"token" : "大橋",

"start_offset" : 10,

"end_offset" : 12,

"type" : "CN_WORD",

"position" : 5

},

{

"token" : "讓",

"start_offset" : 12,

"end_offset" : 13,

"type" : "CN_CHAR",

"position" : 6

},

{

"token" : "港澳",

"start_offset" : 13,

"end_offset" : 15,

"type" : "CN_WORD",

"position" : 7

},

{

"token" : "與",

"start_offset" : 15,

"end_offset" : 16,

"type" : "CN_CHAR",

"position" : 8

},

{

"token" : "國家",

"start_offset" : 16,

"end_offset" : 18,

"type" : "CN_WORD",

"position" : 9

},

{

"token" : "融合",

"start_offset" : 18,

"end_offset" : 20,

"type" : "CN_WORD",

"position" : 10

},

{

"token" : "更",

"start_offset" : 20,

"end_offset" : 21,

"type" : "CN_CHAR",

"position" : 11

},

{

"token" : "緊密",

"start_offset" : 21,

"end_offset" : 23,

"type" : "CN_WORD",

"position" : 12

}

]

}

4..3 分詞檢索

4.3.1 建立索引

curl -XPUT http://lee:9200/index

4.3.2 索引對映

curl -XPOST http://lee:9200/index/fulltext/_mapping -H 'Content-Type:application/json' -d '

{

"properties": {

"content": {

"type": "text",

"analyzer": "ik_max_word",

"search_analyzer": "ik_max_word"

}

}

}'

4.3.3 索引文件

curl -XPOST http://lee:9200/index/fulltext/1 -H 'Content-Type:application/json' -d '

{"content":"美國留給伊拉克的是個爛攤子嗎"}

'

curl -XPOST http://lee:9200/index/fulltext/2 -H 'Content-Type:application/json' -d '

{"content":"公安部：各地校車將享最高路權"}

'

curl -XPOST http://lee:9200/index/fulltext/3 -H 'Content-Type:application/json' -d '

{"content":"中韓漁警衝突調查：韓警平均每天扣1艘中國漁船"}

'

curl -XPOST http://lee:9200/index/fulltext/4 -H 'Content-Type:application/json' -d '

{"content":"中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"}

'

4.3.4 查詢

curl -XPOST http://lee:9200/index/fulltext/_search -H 'Content-Type:application/json' -d '

{

"query" : { "match" : { "content" : "中國" }},

"highlight" : {

"pre_tags" : ["<tag1>", "<tag2>"],

"post_tags" : ["</tag1>", "</tag2>"],

"fields" : {

"content" : {}

}

}

}

'

4.3.5 查詢結果

{

"took": 136,

"timed_out": false,

"_shards": {

"total": 5,

"successful": 5,

"skipped": 0,

"failed": 0

},

"hits": {

"total": 2,

"max_score": 0.6489038,

"hits": [{

"_index": "index",

"_type": "fulltext",

"_id": "4",

"_score": 0.6489038,

"_source": {

"content": "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"

},

"highlight": {

"content": ["<tag1>中國</tag1>駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"]

}

}, {

"_index": "index",

"_type": "fulltext",

"_id": "3",

"_score": 0.2876821,

"_source": {

"content": "中韓漁警衝突調查：韓警平均每天扣1艘中國漁船"

},

"highlight": {

"content": ["中韓漁警衝突調查：韓警平均每天扣1艘<tag1>中國</tag1>漁船"]

}

}]

}

}

參考資料

elasticsearch6.x ik中文分詞整合

1. 選擇ik版本

2. 線上安裝

3. 重啟es

4. IK測試

4.1 ik_max_word分詞

4.2 ik_smart分詞

4..3 分詞檢索

elasticsearch6.x ik中文分詞整合

solr5.x整合IK中文分詞

solr5.3.1 整合IK中文分詞器

ES[7.6.x]學習筆記（七）IK中文分詞器

es5.4安裝head、ik中文分詞插件

Solr6.6.0添加IK中文分詞器

elastic ik中文分詞測試

Solr6.2搭建和配置ik中文分詞器

IK中文分詞器安裝

solr與ik中文分詞的配置，以及新增Core（Add Core）的方式

學習筆記:從0開始學習大資料-29. solr增加ik中文分詞器並匯入doc，pdf文件全文檢索

solr 6.2.0系列教程（二）IK中文分詞器配置及新增擴充套件詞、停止詞、同義詞

Elasticsearch5.5.1安裝IK中文分詞器

淘淘商城23_solr在Linux上的操作02_安裝IK中文分詞器

solr5.5版本中ik中文分詞配置

IK中文分詞擴充套件自定義詞典！！！

ElasticSearch系列五：掌握ES使用IK中文分詞器

solr6.6配置IK中文分詞、IK擴充套件詞、同義詞、pinyin4j拼音分詞

Centos7 Elasticsearch+IK中文分詞+Kibana

solr6.4+拼音分詞與ik中文分詞

elasticsearch6.x ik中文分詞整合

1. 選擇ik版本

2. 線上安裝

3. 重啟es

4. IK測試

4.1 ik_max_word分詞

4.2 ik_smart分詞

4..3 分詞檢索

相關推薦