说明

安装 elasticsearch 的 ik 和 pinyin 分词插件,插件的版本要和 elasticsearch 的版本一致

ik 分词地址: https://github.com/medcl/elasticsearch-analysis-ik/

pinyin分词地址: https://github.com/medcl/elasticsearch-analysis-pinyin/

本文使用 elasticsearch 5.6.9 安装

开始

拉取镜像

docker pull elasticsearch:5.6.9

下载插件包

mkdir docker # 先建个文件夹
# 下载 ik 插件
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.6.9/elasticsearch-analysis-ik-5.6.9.zip

# 解压
unzip elasticsearch-analysis-ik-5.6.9.zip -d analysis-ik

# 下载 pinyin 插件
wget https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v5.6.9/elasticsearch-analysis-pinyin-5.6.9.zip

#解压
unzip elasticsearch-analysis-pinyin-5.6.9.zip -d analysis-pinyin

..

创建 Dockerfile

FROM elasticsearch:5.6.9
ADD analysis-ik /usr/share/elasticsearch/plugins/analysis-ik  
ADD analysis-pinyin /usr/share/elasticsearch/plugins/analysis-pinyin

.

docker build -f Dockerfile -t elasticsearch-ik-pinyin:5.6.9 .

成功创建显示:

root@Alone88-Uos:~/docker/els6# docker build -f Dockerfile -t elasticsearch-ik-pinyin:5.6.9 .
Sending build context to Docker daemon  18.01MB
Step 1/3 : FROM elasticsearch:5.6.9
 ---> 5c1e1ecfe33a
Step 2/3 : ADD analysis-ik /usr/share/elasticsearch/plugins/analysis-ik
 ---> 883cd55df8a8
Step 3/3 : ADD analysis-pinyin /usr/share/elasticsearch/plugins/analysis-pinyin
 ---> 8c9220f304be
Successfully built 8c9220f304be
Successfully tagged elasticsearch-ik-pinyin:5.6.9

创建容器

docker run -e ES_JAVA_OPTS="-Xms256m -Xmx256m" -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" --name elasticsearch_test elasticsearch-ik-pinyin:5.6.9

-e ES_JAVA_OPTS="-Xms256m -Xmx256m" 是设置 elasticsearch 启动的内存大小,默认是系统一半内存

-e discovery.type 是设置为单节点

elasticsearch-ik-pinyin:5.6.9 就是构建镜像的镜像名和版本号

测试分词

测试拼音

请求 https://127.0.0.1:9200/_analyze
请求方式为 post
请求主体

{   "text": "中华人民共和国国徽",   "analyzer": "pinyin" }

返回

**{
    "tokens":**[
        **{
            "token":"zhong",
            "start_offset":0,
            "end_offset":1,
            "type":"word",
            "position":0
        },
        **{
            "token":"zhrmghggh",
            "start_offset":0,
            "end_offset":9,
            "type":"word",
            "position":0
        },
        **{
            "token":"hua",
            "start_offset":1,
            "end_offset":2,
            "type":"word",
            "position":1
        },
        **{
            "token":"ren",
            "start_offset":2,
            "end_offset":3,
            "type":"word",
            "position":2
        },
        **{
            "token":"min",
            "start_offset":3,
            "end_offset":4,
            "type":"word",
            "position":3
        },
        **{
            "token":"gong",
            "start_offset":4,
            "end_offset":5,
            "type":"word",
            "position":4
        },
        **{
            "token":"he",
            "start_offset":5,
            "end_offset":6,
            "type":"word",
            "position":5
        },
        **{
            "token":"guo",
            "start_offset":6,
            "end_offset":7,
            "type":"word",
            "position":6
        },
        **{
            "token":"guo",
            "start_offset":7,
            "end_offset":8,
            "type":"word",
            "position":7
        },
        **{
            "token":"hui",
            "start_offset":8,
            "end_offset":9,
            "type":"word",
            "position":8
        }
    ]
}

测试 ik 分词

analyzer:可填项有:chinese|ik_max_word|ik_smart,其中chinese是ES的默认分词器选项,ik_max_word(最细粒度划分)和ik_smart(最少划分)是ik中文分词器选项

请求地址: https://127.0.0.1:9200/_analyze
请求方式 : post
请求主体:

**{
    "text":"中华人民共和国国徽",
    "analyzer":"ik_max_word"
}

返回

**{
    "tokens":**[
        **{
            "token":"中华人民共和国",
            "start_offset":0,
            "end_offset":7,
            "type":"CN_WORD",
            "position":0
        },
        **{
            "token":"中华人民",
            "start_offset":0,
            "end_offset":4,
            "type":"CN_WORD",
            "position":1
        },
        **{
            "token":"中华",
            "start_offset":0,
            "end_offset":2,
            "type":"CN_WORD",
            "position":2
        },
        **{
            "token":"华人",
            "start_offset":1,
            "end_offset":3,
            "type":"CN_WORD",
            "position":3
        },
        **{
            "token":"人民共和国",
            "start_offset":2,
            "end_offset":7,
            "type":"CN_WORD",
            "position":4
        },
        **{
            "token":"人民",
            "start_offset":2,
            "end_offset":4,
            "type":"CN_WORD",
            "position":5
        },
        **{
            "token":"共和国",
            "start_offset":4,
            "end_offset":7,
            "type":"CN_WORD",
            "position":6
        },
        **{
            "token":"共和",
            "start_offset":4,
            "end_offset":6,
            "type":"CN_WORD",
            "position":7
        },
        **{
            "token":"国",
            "start_offset":6,
            "end_offset":7,
            "type":"CN_CHAR",
            "position":8
        },
        **{
            "token":"国徽",
            "start_offset":7,
            "end_offset":9,
            "type":"CN_WORD",
            "position":9
        }
    ]
}

注:不管是拼音分词器还是IK分词器,当深入搜索一条数据是时,必须是通过分词器分析的数据,才能被搜索到,否则搜索不到

IK分词和拼音分词的组合使用

PUT /my_index
{
  "settings": {
        "analysis": {
            "analyzer": {
                "ik_smart_pinyin": {
                    "type": "custom",
                    "tokenizer": "ik_smart",
                    "filter": ["my_pinyin", "word_delimiter"]
                },
                "ik_max_word_pinyin": {
                    "type": "custom",
                    "tokenizer": "ik_max_word",
                    "filter": ["my_pinyin", "word_delimiter"]
                }
            },
            "filter": {
                "my_pinyin": {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : true,
                    "keep_full_pinyin" : true,
                    "keep_original" : true,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term" : true 
                }
            }
        }
  }
  
}

当我们建type时,需要在字段的analyzer属性填写自己的映射

PUT /my_index/my_type/_mapping
{
    "my_type":{
      "properties": {
        "id":{
          "type": "integer"
        },
        "name":{
          "type": "text",
          "analyzer": "ik_smart_pinyin"
        }
      }
    }
}
Last modification:October 17th, 2020 at 04:59 pm
如果觉得我的文章对你有用,请随意赞赏