Mmseg Analysis for ElasticSearch
==========
The Mmseg Analysis plugin integrates Lucene mmseg4j-analyzer:http://code.google.com/p/mmseg4j/ into elasticsearch, support customized dictionary.
Current MMSeg4j version v1.8.5
The plugin includes a `mmseg` analyzer and a `mmseg` tokenizer.
Version
-—————
master | 0.19.11 → master
1.1.1 | 0.19.11 → master
1.1.0 | 0.19.4 → master
1.0.0 | 0.16.2 → 0.19.3
Install
-—————
bin/plugin -install weiweiwang/elasticsearch-analysis-mmseg/1.1.1
also download the dict files,unzip these dict file to your elasticsearch’s config folder,such as: your-es-root/config/mmseg
cd config wget http://github.com/downloads/weiweiwang/elasticsearch-analysis-mmseg/mmseg.zip --no-check-certificate unzip mmseg.zip rm mmseg.zip
you need a service restart after that!
Analysis Configuration (elasticsearch.yml)
-—————
index: analysis: analyzer: mmseg: alias: [news_analyzer, mmseg_analyzer] type: org.elasticsearch.index.analysis.MMsegAnalyzerProvider index.analysis.analyzer.default.type : "mmseg"
additional parameters that can be used to customize the mmseg tokenizer
index: analysis: tokenizer: mmseg_maxword: type: mmseg seg_type: "max_word" mmseg_complex: type: mmseg seg_type: "complex" mmseg_simple: type: mmseg seg_type: "simple"
Mapping Configuration
-—————
Here is a quick example:
1.create a index
curl -XPUT http://localhost:9200/index
2.create a mapping
curl -XPOST http://localhost:9200/index/fulltext/_mapping -d' { "fulltext": { "_all": { "indexAnalyzer": "mmseg", "searchAnalyzer": "mmseg", "term_vector": "no", "store": "false" }, "properties": { "content": { "type": "string", "store": "no", "term_vector": "with_positions_offsets", "indexAnalyzer": "mmseg", "searchAnalyzer": "mmseg", "include_in_all": "true", "boost": 8 } } } }'
3.indexing some docs
curl -XPOST http://localhost:9200/index/fulltext/1 -d' {content:"美国留给伊拉克的是个烂摊子吗"} ' curl -XPOST http://localhost:9200/index/fulltext/2 -d' {content:"公安部:各地校车将享最高路权"} ' curl -XPOST http://localhost:9200/index/fulltext/3 -d' {content:"中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"} ' curl -XPOST http://localhost:9200/index/fulltext/4 -d' {content:"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"} '
4.query with highlighting
curl -XPOST http://localhost:9200/index/fulltext/_search -d' { "query" : { "term" : { "content" : "中国" }}, "highlight" : { "pre_tags" : ["<tag1>", "<tag2>"], "post_tags" : ["</tag1>", "</tag2>"], "fields" : { "content" : {} } } } '
here is the query result
{ "took": 14, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 2, "max_score": 2, "hits": [ { "_index": "index", "_type": "fulltext", "_id": "4", "_score": 2, "_source": { "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首" }, "highlight": { "content": [ "<tag1>中国</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首 " ] } }, { "_index": "index", "_type": "fulltext", "_id": "3", "_score": 2, "_source": { "content": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船" }, "highlight": { "content": [ "均每天扣1艘<tag1>中国</tag1>渔船 " ] } } ] } }
Mmseg4j dict generation
-—————
see http://weiweiwang.github.com/elasticsearch-configuration-and-performance-tuning.html
have fun.