Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jieba插件对包含空格的字符串分词后,包含了值为空格的token #17

Open
zh6335901 opened this issue Sep 8, 2016 · 4 comments

Comments

@zh6335901
Copy link

zh6335901 commented Sep 8, 2016

使用jieba插件分词,对包含空格的字符串分词,会包含值为空格的token,search和index模式都是如此,比如:

curl http://localhost:9200/test/_analyze?text=你好%20北京&analyzer=jieba_search&pretty
{
"tokens": [
{
"token": "你好",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": " ",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "北京",
"start_offset": 3,
"end_offset": 5,
"type": "word",
"position": 2
}
]
}

那这样,如果用户搜索内容包括空格时,就有可能影响搜索结果了,因为搜索分词时包含空格,但是es索引的内容可能不包含空格。

@zh6335901
Copy link
Author

在尝试了使用trim和stop filter,都没办法过滤掉空格时,
我的解决方案是在JiebaTokenFilter类的incrementToken的方法对值为空格的token进行过滤,经过测试是可以的。
但是由于我对es插件机制和java并不熟悉,所以我不确定这是否是个好的方案。
如果这是个可行的方案,那我提个pull requset, 如果有更好的办法,麻烦告知我一下哈
多谢!

@Steven-Z-Yang
Copy link

同义词那边回答里面我用了whitespace tokenizer ,所以空格都被过滤掉了

@tsaiian
Copy link

tsaiian commented Sep 25, 2020

我是用 trim 後 刪除空字串 (remove_empty)的方法:

    "analysis": {
      "analyzer": {
        "norm_jieba_index": {
          "tokenizer": "jieba_index",
          "filter": [
            "lowercase",
            "trim",
            "remove_empty"
          ]
        },
        "norm_jieba_search": {
          "tokenizer": "jieba_search",
          "filter": [
            "lowercase",
            "trim",
            "remove_empty"
          ]
        }
      },
      "filter": {
        "remove_empty": {
          "type": "stop",
          "stopwords": [""]
        }
      }
    }

@silencesimon
Copy link

我是用 trim 後 刪除空字串 (remove_empty)的方法:

    "analysis": {
      "analyzer": {
        "norm_jieba_index": {
          "tokenizer": "jieba_index",
          "filter": [
            "lowercase",
            "trim",
            "remove_empty"
          ]
        },
        "norm_jieba_search": {
          "tokenizer": "jieba_search",
          "filter": [
            "lowercase",
            "trim",
            "remove_empty"
          ]
        }
      },
      "filter": {
        "remove_empty": {
          "type": "stop",
          "stopwords": [""]
        }
      }
    }

我按这个来,可以了。感谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants