实践003-elasticsearch之analyzer

[toc]


一、Elasticsearch analizer组成

1. 组成三大件

1.1 Character Filter(字符过滤器)

用于原始文本过滤,比如原文本为html的文本,需要去掉html标签: html_strip

1.2 Tokenizer(分词器)

按某种规则(比如空格) 对输入(Character Filter处理完的文本)进行切分

1.3 Token Filter(分词过滤器)

对Tokenizer切分后的准term进行二次加工,比如大写->小写,stop word过滤(跑去in、the等)

二、Analyzer测试分词

2.1 指定analyzer测试分词

2.1.1 standard analyzer

  • Tokenizer: Standard Tokenize

    基于unicode文本分割,适于大多数语言
  • Token Filter: Lower Case Token Filter/Stop Token Filter(默认禁用)

    • LowerCase Token Filter: 过滤后,变小写-->所以standard默认分词后的搜索匹配是小写
    • Stop Token Filter(默认禁用) -->停用词:分词后索引里会丢弃的
GET _analyze
{
 "analyzer": "standard",
 "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

2.1.2 standard结果可见

  • 全小写
  • 数字还在
  • 没有stop word(默认关闭的)
{
 "tokens" : [
 {
 "token" : "for",
 "start_offset" : 3,
 "end_offset" : 6,
 "type" : "<ALPHANUM>",
 "position" : 0
 },
 {
 "token" : "example",
 "start_offset" : 7,
 "end_offset" : 14,
 "type" : "<ALPHANUM>",
 "position" : 1
 },
 {
 "token" : "uuu",
 "start_offset" : 16,
 "end_offset" : 19,
 "type" : "<ALPHANUM>",
 "position" : 2
 },
 {
 "token" : "you",
 "start_offset" : 20,
 "end_offset" : 23,
 "type" : "<ALPHANUM>",
 "position" : 3
 },
 {
 "token" : "can",
 "start_offset" : 24,
 "end_offset" : 27,
 "type" : "<ALPHANUM>",
 "position" : 4
 },
 {
 "token" : "see",
 "start_offset" : 28,
 "end_offset" : 31,
 "type" : "<ALPHANUM>",
 "position" : 5
 },
 {
 "token" : "27",
 "start_offset" : 32,
 "end_offset" : 34,
 "type" : "<NUM>",
 "position" : 6
 },
 {
 "token" : "accounts",
 "start_offset" : 35,
 "end_offset" : 43,
 "type" : "<ALPHANUM>",
 "position" : 7
 },
 {
 "token" : "in",
 "start_offset" : 44,
 "end_offset" : 46,
 "type" : "<ALPHANUM>",
 "position" : 8
 },
 {
 "token" : "id",
 "start_offset" : 47,
 "end_offset" : 49,
 "type" : "<ALPHANUM>",
 "position" : 9
 },
 {
 "token" : "idaho",
 "start_offset" : 51,
 "end_offset" : 56,
 "type" : "<ALPHANUM>",
 "position" : 10
 }
 ]
}

2.2 其他analyzer

  • standard
  • stop stopword剔除
  • simple
  • whitespace 只用空白符分割,不剔除
  • keyword 完整文本,不分词

2.3 指定Tokenizer和Token Filter测试分词

2.3.1 使用standard相同的Tokenizer和Filter

前面一节说:standard analyzer使用的Tokenizer是standard Tokenizer 使用的filter是lowercase, 我们通过使用tokenizer和filter来替换analyzer试试:

GET _analyze
{
 "tokenizer": "standard",
 "filter": ["lowercase"], 
 "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

结果和上面一致:

{
 "tokens" : [
 {
 "token" : "for",
 "start_offset" : 3,
 "end_offset" : 6,
 "type" : "<ALPHANUM>",
 "position" : 0
 },
 {
 "token" : "example",
 "start_offset" : 7,
 "end_offset" : 14,
 "type" : "<ALPHANUM>",
 "position" : 1
 },
 {
 "token" : "uuu",
 "start_offset" : 16,
 "end_offset" : 19,
 "type" : "<ALPHANUM>",
 "position" : 2
 },
 {
 "token" : "you",
 "start_offset" : 20,
 "end_offset" : 23,
 "type" : "<ALPHANUM>",
 "position" : 3
 },
 {
 "token" : "can",
 "start_offset" : 24,
 "end_offset" : 27,
 "type" : "<ALPHANUM>",
 "position" : 4
 },
 {
 "token" : "see",
 "start_offset" : 28,
 "end_offset" : 31,
 "type" : "<ALPHANUM>",
 "position" : 5
 },
 {
 "token" : "27",
 "start_offset" : 32,
 "end_offset" : 34,
 "type" : "<NUM>",
 "position" : 6
 },
 {
 "token" : "accounts",
 "start_offset" : 35,
 "end_offset" : 43,
 "type" : "<ALPHANUM>",
 "position" : 7
 },
 {
 "token" : "in",
 "start_offset" : 44,
 "end_offset" : 46,
 "type" : "<ALPHANUM>",
 "position" : 8
 },
 {
 "token" : "id",
 "start_offset" : 47,
 "end_offset" : 49,
 "type" : "<ALPHANUM>",
 "position" : 9
 },
 {
 "token" : "idaho",
 "start_offset" : 51,
 "end_offset" : 56,
 "type" : "<ALPHANUM>",
 "position" : 10
 }
 ]
}

2.3.2 增加一个stop的filter再试

GET _analyze
{
 "tokenizer": "standard",
 "filter": ["lowercase","stop"], 
 "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

观察发现:in没了,所以stop里应该是有in这个过滤成分的呢~

filter里有两个(使用了两个TokenFilter--ES的字段都可以使多个多个值的就是数组式的)如果去掉filter里的lowercase, 就不会转大写为小写了,这里就不贴出结果了~

{
 "tokens" : [
 {
 "token" : "example",
 "start_offset" : 7,
 "end_offset" : 14,
 "type" : "<ALPHANUM>",
 "position" : 1
 },
 {
 "token" : "uuu",
 "start_offset" : 16,
 "end_offset" : 19,
 "type" : "<ALPHANUM>",
 "position" : 2
 },
 {
 "token" : "you",
 "start_offset" : 20,
 "end_offset" : 23,
 "type" : "<ALPHANUM>",
 "position" : 3
 },
 {
 "token" : "can",
 "start_offset" : 24,
 "end_offset" : 27,
 "type" : "<ALPHANUM>",
 "position" : 4
 },
 {
 "token" : "see",
 "start_offset" : 28,
 "end_offset" : 31,
 "type" : "<ALPHANUM>",
 "position" : 5
 },
 {
 "token" : "27",
 "start_offset" : 32,
 "end_offset" : 34,
 "type" : "<NUM>",
 "position" : 6
 },
 {
 "token" : "accounts",
 "start_offset" : 35,
 "end_offset" : 43,
 "type" : "<ALPHANUM>",
 "position" : 7
 },
 {
 "token" : "id",
 "start_offset" : 47,
 "end_offset" : 49,
 "type" : "<ALPHANUM>",
 "position" : 9
 },
 {
 "token" : "idaho",
 "start_offset" : 51,
 "end_offset" : 56,
 "type" : "<ALPHANUM>",
 "position" : 10
 }
 ]
}

三、Elasticsearch自带的Analyzer组件

3.1 ES自带的character filter

3.1.1 什么是character filter?

在tokenizer之前,对文本进行处理,例如增加删除或替换字符;可以设置多个character filter.

它会影响tokenizer的 positionoffset.

3.1.2 一些自带的character filter

  • html strip: 剔除html标签
  • mapping: 字符串替换
  • pattern replace: 正则匹配替换

3.2 ES自带的tokenizer

3.2.1 什么是tokenizer?

将原始文本(character filter处理后的原始文本)按照一定规则进行切分。(term or token)

3.2.2 自带的tokenizer

  • whitespace: 空格分词
  • standard
  • uax_url_email: url/email
  • pattern
  • keyword: 不分词
  • pattern hierarchy: 路径名拆分

3.2.3 可以用java插件,实现自定义的tokenizer

3.3 ES自带的token filter

3.3.1 什么是tokenizer?

将tokenizer输出的单词进行加工(加工term)

3.3.2 自带的token filter

  • lowercase: 小写化
  • stop: 去除停用词(in/the等)
  • synonym: 添加近义词

四、Demo案例

4.1 html_strip/maping+keyword

GET _analyze
{
 "tokenizer": "keyword",
 "char_filter": [
 {
 "type": "html_strip"
 },
 {
 "type": "mapping",
 "mappings": [
 "- => _", ":) => _happy_", ":( => _sad_"
 ]
 }
 ],
 "text": "<b>Hello :) this-is-my-book,that-is-not :( World</b>"
}

使用了 tokenizer:keyword,也就是切词时完整保留,不切割;

使用了char_filter两个:html_strip(剔除掉html标签),mapping(用指定内容替换原内容)

上面结果:html标签去掉了,减号符替换成了下划线

{
 "tokens" : [
 {
 "token" : "Hello _happy_ this_is_my_book,that_is_not _sad_ World",
 "start_offset" : 3,
 "end_offset" : 52,
 "type" : "word",
 "position" : 0
 }
 ]
}

4.2 char_filter使用正则替换

GET _analyze
{
 "tokenizer": "standard",
 "char_filter": [
 {
 "type": "pattern_replace",
 "pattern": "http://(.*)",
 "replacement": "$1"
 }
 ],
 "text": "http://www.elastic.co"
}

正则替换:type/pattern/replacement

结果:

{
 "tokens" : [
 {
 "token" : "www.elastic.co",
 "start_offset" : 0,
 "end_offset" : 21,
 "type" : "<ALPHANUM>",
 "position" : 0
 }
 ]
}

4.3 tokenizer使用目录切分

GET _analyze
{
 "tokenizer": "path_hierarchy",
 "text": "/user/niewj/a/b/c"
}

分词结果:

{
 "tokens" : [
 {
 "token" : "/user",
 "start_offset" : 0,
 "end_offset" : 5,
 "type" : "word",
 "position" : 0
 },
 {
 "token" : "/user/niewj",
 "start_offset" : 0,
 "end_offset" : 11,
 "type" : "word",
 "position" : 0
 },
 {
 "token" : "/user/niewj/a",
 "start_offset" : 0,
 "end_offset" : 13,
 "type" : "word",
 "position" : 0
 },
 {
 "token" : "/user/niewj/a/b",
 "start_offset" : 0,
 "end_offset" : 15,
 "type" : "word",
 "position" : 0
 },
 {
 "token" : "/user/niewj/a/b/c",
 "start_offset" : 0,
 "end_offset" : 17,
 "type" : "word",
 "position" : 0
 }
 ]
}

4.4 tokenfilter之whitespace与stop

GET _analyze
{
 "tokenizer": "whitespace",
 "filter": ["stop"], // ["lowercase", "stop"]
 "text": "The girls in China are playing this game !"
}

结果:in、this都被剔除了(stopword), 但是term是大写的还保留, 因为tokenizer用的是whitespace而非standard

{
 "tokens" : [
 {
 "token" : "The",
 "start_offset" : 0,
 "end_offset" : 3,
 "type" : "word",
 "position" : 0
 },
 {
 "token" : "girls",
 "start_offset" : 4,
 "end_offset" : 9,
 "type" : "word",
 "position" : 1
 },
 {
 "token" : "China",
 "start_offset" : 13,
 "end_offset" : 18,
 "type" : "word",
 "position" : 3
 },
 {
 "token" : "playing",
 "start_offset" : 23,
 "end_offset" : 30,
 "type" : "word",
 "position" : 5
 },
 {
 "token" : "game",
 "start_offset" : 36,
 "end_offset" : 40,
 "type" : "word",
 "position" : 7
 },
 {
 "token" : "!",
 "start_offset" : 41,
 "end_offset" : 42,
 "type" : "word",
 "position" : 8
 }
 ]
}

4.5 自定义analyzer

4.5.1 settings自定义analyzer

PUT my_new_index
{
 "settings": {
 "analysis": {
 "analyzer": {
 "my_analyzer":{ // 1.自定义analyzer的名称
 "type": "custom",
 "char_filter": ["my_emoticons"], 
 "tokenizer": "my_punctuation", 
 "filter": ["lowercase", "my_english_stop"]
 }
 },
 "tokenizer": {
 "my_punctuation": { // 3.自定义tokenizer的名称
 "type": "pattern", "pattern":"[ .,!?]"
 }
 },
 "char_filter": {
 "my_emoticons": { // 2.自定义char_filter的名称
 "type": "mapping", "mappings":[":) => _hapy_", ":( => _sad_"]
 }
 },
 "filter": {
 "my_english_stop": { // 4.自定义token filter的名称
 "type": "stop", "stopwords": "_english_"
 }
 }
 }
 }
}

4.5.2 测试自定义的analyzer:

POST my_new_index/_analyze
{
 "analyzer": "my_analyzer",
 "text": "I'm a :) person in the earth, :( And You? "
}

输出

{
 "tokens" : [
 {
 "token" : "i'm",
 "start_offset" : 0,
 "end_offset" : 3,
 "type" : "word",
 "position" : 0
 },
 {
 "token" : "_hapy_",
 "start_offset" : 6,
 "end_offset" : 8,
 "type" : "word",
 "position" : 2
 },
 {
 "token" : "person",
 "start_offset" : 9,
 "end_offset" : 15,
 "type" : "word",
 "position" : 3
 },
 {
 "token" : "earth",
 "start_offset" : 23,
 "end_offset" : 28,
 "type" : "word",
 "position" : 6
 },
 {
 "token" : "_sad_",
 "start_offset" : 30,
 "end_offset" : 32,
 "type" : "word",
 "position" : 7
 },
 {
 "token" : "you",
 "start_offset" : 37,
 "end_offset" : 40,
 "type" : "word",
 "position" : 9
 }
 ]
}
作者:niewj原文地址:https://segmentfault.com/a/1190000041764931

%s 个评论

要回复文章请先登录注册