nlpchina / elasticsearch-analysis-ansj Goto Github PK

View Code? Open in Web Editor NEW

629.0 58.0 191.0 15.14 MB

License: Apache License 2.0

Java 100.00%

ansj elasticsearch ansj-elasticsearch

elasticsearch-analysis-ansj's Introduction

elasticsearch-analysis-ansj: elasticsearch 的中文分词插件

前言

elasticsearch-analysis-ansj 是一个基于 ansj 分词算法的 elasticsearch 的中文分词插件。

编译

mvn package

编译成功后，将会生成打包好的插件压缩包：target/releases/elasticsearch-analysis-ansj-<版本号>-release.zip。

安装

安装命令

在 es 安装目录下执行下面的命令安装插件：

./bin/elasticsearch-plugin install file:///<你的路径>/elasticsearch-analysis-ansj-<版本号>-release.zip

安装完成后，会生成一个默认的配置文件： <ES_HOME>/config/elasticsearch-analysis-ansj/ansj.cfg.yml，根据需要修改此文件即可。

测试

安装完成后，启动 es 集群。通过以下方式测试是否安装正确：
方法一：
通过 kibana 执行 GET /_cat/ansj?text=**&type=index_ansj 命令，测试 index_ansj 分词器，返回内容如下：

{
  "result": [
    {
      "name": "**",
      "nature": "ns",
      "offe": 0,
      "realName": "**",
      "synonyms": null
    },
    {
      "name": "中",
      "nature": "f",
      "offe": 0,
      "realName": "中",
      "synonyms": null
    },
    {
      "name": "国",
      "nature": "n",
      "offe": 1,
      "realName": "国",
      "synonyms": null
    }
  ]
}

方法二：
通过 kibana 执行 GET /_cat/ansj/config 命令，获取配置文件内容如下：

{
  "ambiguity": [
    "ambiguity"
  ],
  "stop": [
    "stop"
  ],
  "synonyms": [
    "synonyms"
  ],
  "crf": [
    "crf"
  ],
  "isQuantifierRecognition": "true",
  "isRealName": "false",
  "isNumRecognition": "true",
  "isNameRecognition": "true",
  "dic": [
    "dic"
  ]
}

使用

第一步：创建索引

PUT /test_index?pretty
{
  "settings" : {
    "index" : {
      "number_of_shards" : 16,
      "number_of_replicas" : 1,
      "refresh_interval":"5s"
    }
  },
  "mappings" : {
    "properties" : {
      "test_field": { 
        "type": "text",
        "analyzer": "index_ansj",
        "search_analyzer": "query_ansj"
      }
    }
  }
}

说明：

test_index: 用于测试的索引名称；

test_field: 用于测试的字段；

指定字段的索引分词器为： index_ansj ；

指定字段的搜索分词器为： query_ansj ；

测试索引配置是否正确：

POST /test_index/_analyze
{
  "field": "test_field",
  "text": "**"
}

第二步：添加数据

PUT test_index/_bulk?refresh
{"create":{ }}
{ "test_field" : "**" }
{"create":{ }}
{ "test_field" : "中华人民共和国" }
{"create":{ }}
{ "test_field" : "**有56个民族" }
{"create":{ }}
{ "test_field" : "**是社会主义国家" }

第三步：执行搜索

GET test_index/_search
{
  "query": {
    "match": {
      "test_field": {
        "query": "**"
      }
    }
  }
}

注意：

上述操作语句都是在 kibana 的 dev_tools 里执行的；

上述操作语句仅在 es 8.x 版本上测试过，其它版本请根据实际情况调整。

插件功能

安装插件后，在 es 集群中会增加以下功能：

三个 analyzer:

index_ansj (建议索引使用)
query_ansj (建议搜索使用)
dic_ansj

三个 tokenizer：

index_ansj (建议索引使用)
query_ansj (建议搜索使用)
dic_ansj

http 接口：

/_cat/ansj: 执行分词
/_cat/ansj/config: 显示全部配置
/_ansj/flush/config: 刷新全部配置
/_ansj/flush/dic: 更新全部词典。包括用户自定义词典,停用词典,同义词典,歧义词典,crf

配置文件

配置文件格式

ansj:
  #默认参数配置
  isNameRecognition: true #开启姓名识别
  isNumRecognition: true #开启数字识别
  isQuantifierRecognition: true #是否数字和量词合并
  isRealName: false #是否保留真实词语,建议保留false

  #用户自定词典配置
  #dic: default.dic #也可以写成 file://default.dic , 如果未配置dic,则此词典默认加载
  # http方式加载
  #dic_d1: http://xxx/xx.dic
  # jar中文件加载
  #dic_d2: jar://org.ansj.dic.DicReader|/dic2.dic
  # 从数据库中加载
  #dic_d3: jdbc://jdbc:mysql://xxxx:3306/ttt?useUnicode=true&characterEncoding=utf-8&zeroDateTimeBehavior=convertToNull|username|password|select name as name,nature,freq from dic where type=1
  # 从自定义类中加载,YourClas  extends PathToStream
  #dic_d3: class://xxx.xxx.YourClas|ohterparam

  #过滤词典配置
  #stop: http,file,jar,class,jdbc 都支持
  #stop_key1: ...

  #歧义词典配置
  #ambiguity: http,file,jar,class,jdbc 都支持
  #ambiguity_key1: ...

  #同义词词典配置
  #synonyms: http,file,jar,class,jdbc 都支持
  #synonyms_key1: ...

配置文件示例

使用本地文件词库

ansj:
  # 开启姓名识别
  isNameRecognition: false
  # 开启数字识别
  isNumRecognition: true
  # 是否数字和量词合并
  isQuantifierRecognition: false
  # 是否保留真实词语
  isRealName: false
  # 词典
  dic: file:///data/elasticsearch-dic/ansj/main.dic
  # 停词（过滤词）词典
  stop: file:///data/elasticsearch-dic/ansj/stop.dic
  # 歧义词词典配置
  ambiguity: file:///data/elasticsearch-dic/ansj/ambiguity.dic
  # 同义词词典配置
  synonyms: file:///data/elasticsearch-dic/ansj/synonyms.dic

使用 HTTP 协议加载词库

ansj:
  # 开启姓名识别
  isNameRecognition: false
  # 开启数字识别
  isNumRecognition: true
  # 是否数字和量词合并
  isQuantifierRecognition: false
  # 是否保留真实词语
  isRealName: false
  # 词典
  dic: http://example.com/elasticsearch-dic/ansj/main.dic
  # 停词（过滤词）词典
  stop: http://example.com/elasticsearch-dic/ansj/stop.dic
  # 歧义词词典配置
  ambiguity: http://example.com/elasticsearch-dic/ansj/ambiguity.dic
  # 同义词词典配置
  synonyms: http://example.com/elasticsearch-dic/ansj/synonyms.dic

插件版本与 ES 版本的对应关系

plugin	elasticsearch
1.0.0	0.90.2
1.x	1.x
2.1.1	2.1.1
2.3.1	2.3.1
2.3.2	2.3.2
2.3.3	2.3.3
2.3.4	2.3.4
2.3.5	2.3.5
2.4.0	2.4.0
2.4.1	2.4.1
2.4.2	2.4.2
2.4.3	2.4.3
2.4.4	2.4.4
2.4.5	2.4.5
2.4.6	2.4.6
5.0.0	5.0.0
5.0.1	5.0.1
5.0.2	5.0.2
5.1.1	5.1.1
5.1.2	5.1.2
5.2.0	5.2.0
5.2.1	5.2.1
5.2.2	5.2.2
5.3.0	5.3.0
5.3.1	5.3.1
5.3.2	5.3.2
5.3.3	5.3.3
5.4.0	5.4.0
5.4.1	5.4.1
5.4.2	5.4.2
5.4.3	5.4.3
5.5.0	5.5.0
5.5.1	5.5.1
5.5.2	5.5.2
5.5.3	5.5.3
5.6.0	5.6.0
5.6.1	5.6.1
5.6.2	5.6.2
5.6.3	5.6.3
5.6.4	5.6.4
5.6.5	5.6.5
5.6.6	5.6.6
5.6.7	5.6.7
5.6.8	5.6.8
5.6.9	5.6.9
5.6.10	5.6.10
5.6.11	5.6.11
5.6.12	5.6.12
5.6.13	5.6.13
5.6.14	5.6.14
5.6.15	5.6.15
5.6.16	5.6.16
6.0.0	6.0.0
6.0.1	6.0.1
6.1.0	6.1.0
6.1.1	6.1.1
6.1.2	6.1.2
6.1.3	6.1.3
6.1.4	6.1.4
6.2.0	6.2.0
6.2.1	6.2.1
6.2.2	6.2.2
6.2.3	6.2.3
6.2.4	6.2.4
6.3.0	6.3.0
6.3.1	6.3.1
6.3.2	6.3.2
6.4.0	6.4.0
6.4.1	6.4.1
6.4.2	6.4.2
6.4.3	6.4.3
6.5.0	6.5.0
6.5.1	6.5.1
6.5.2	6.5.2
6.5.3	6.5.3
6.5.4	6.5.4
6.6.0	6.6.0
6.6.1	6.6.1
6.6.2	6.6.2
6.7.0	6.7.0
6.7.1	6.7.1
6.7.2	6.7.2
6.8.0	6.8.0
6.8.1	6.8.1
6.8.2	6.8.2
6.8.3	6.8.3
6.8.4	6.8.4
6.8.5	6.8.5
6.8.6	6.8.6
6.8.7	6.8.7
6.8.8	6.8.8
6.8.9	6.8.9
6.8.10	6.8.10
6.8.11	6.8.11
6.8.12	6.8.12
6.8.13	6.8.13
6.8.14	6.8.14
6.8.15	6.8.15
6.8.16	6.8.16
6.8.17	6.8.17
6.8.18	6.8.18
6.8.19	6.8.19
6.8.20	6.8.20
6.8.21	6.8.21
6.8.22	6.8.22
6.8.23	6.8.23
7.0.0	7.0.0
7.0.1	7.0.1
7.1.0	7.1.0
7.1.1	7.1.1
7.2.0	7.2.0
7.2.1	7.2.1
7.3.0	7.3.0
7.3.1	7.3.1
7.3.2	7.3.2
7.4.0	7.4.0
7.4.1	7.4.1
7.4.2	7.4.2
7.5.0	7.5.0
7.5.1	7.5.1
7.5.2	7.5.2
7.6.0	7.6.0
7.6.1	7.6.1
7.6.2	7.6.2
7.7.0	7.7.0
7.7.1	7.7.1
7.8.0	7.8.0
7.8.1	7.8.1
7.9.0	7.9.0
7.9.1	7.9.1
7.9.2	7.9.2
7.9.3	7.9.3
7.17.5	7.17.5
7.17.7	7.17.7
7.17.8	7.17.8
7.17.9	7.17.9
7.17.10	7.17.10
7.17.11	7.17.11
7.17.12	7.17.12
7.17.13	7.17.13
7.17.14	7.17.14
7.17.15	7.17.15
7.17.16	7.17.16
8.3.3	8.3.3
8.5.3	8.5.3
8.6.0	8.6.0
8.6.1	8.6.1
8.6.2	8.6.2
8.7.0	8.7.0
8.7.1	8.7.1
8.8.0	8.8.0
8.8.1	8.8.1
8.8.2	8.8.2
8.9.0	8.9.0
8.9.1	8.9.1
8.9.2	8.9.2
8.10.0	8.10.0
8.10.1	8.10.1
8.10.2	8.10.2
8.10.3	8.10.3
8.10.4	8.10.4
8.11.0	8.11.0
8.11.1	8.11.1
8.11.2	8.11.2
8.11.3	8.11.3

版权

elasticsearch-analysis-ansj is licenced under the Apache License Version 2.0. See the LICENSE file for details.

elasticsearch-analysis-ansj's People

Contributors

Stargazers

Watchers

Forkers

lyy ckchason houdejun214 coolgate jammyjaccy liugangr jy4618272 airclear javajian fuli39 dluobo ligenjian007 iverson0201 vincentruan yiduo 9466 tsunli nomagick jonerxq yuchaozhou denghongdong home1-public fanwenxu jingpeicomp zengzhifeng fengzhihao velee buzzxu zzzhr1990 geolem yy0 geraldsec linbaoling mslycn csloter coderlen zoln pkfresher rateyu kxxu hadoop835 tang-tang chenying99 chienjchienj dizheng bellwind kumangus rayhsieh star-nlp xyuu zhangjianheng panyang clw87 jiashiwen alexmumu experiences lonly197 c-es-plugin lilhx cenwei baiyuang loyaltyji surfingit wmx3ng d4ksn markisme huangshaoze alleyz-favorite yangtaoxf houxinyu lvyue pologood huangpeng1126 erhei0317 smaillife yujingzhou shihuaxing yyljlyy githubyh geekidentity ti-net zhubl lucienlink jacobsy zvictorino liam8 mark2060 hanweisong sdwfxsf liyazhou calj2016 zhiji6 linmingming dachuanlu sword865 fanbbs abia321 li551933 zwj-ml lifeself

elasticsearch-analysis-ansj's Issues

elasticsearch2.1.1与ansj的集成

为什么我的elasticsearch2.1.1版本的安装上ansj后高亮显示的还是单字拆分的呢？而且怎么没有停用词词库呢？

redis更新词库的配置文件

分词文件配置里面的ansj配置是需要些在elasticsearch.yml吗？我写进去还是提示没有找到
[2016-09-14 18:16:24,400][INFO ][ansj-initializer ] 没有找到redis相关配置!

MD5 Hash字符串被分词

Elasticsearch 1.7.2
ansj 插件 1.x 版本

类型 string，未指定 not_analyze，值 fd04fe9b5225461204e75837f1616575，分词器 index_ansj

该值被分词结果如下：

{'tokens': [{'end_offset': 2,
'position': 1,
'start_offset': 0,
'token': 'fd',
'type': 'word'},
{'end_offset': 4,
'position': 2,
'start_offset': 2,
'token': '04',
'type': 'word'},
{'end_offset': 6,
'position': 3,
'start_offset': 4,
'token': 'fe',
'type': 'word'},
{'end_offset': 7,
'position': 4,
'start_offset': 6,
'token': '9',
'type': 'word'},
{'end_offset': 8,
'position': 5,
'start_offset': 7,
'token': 'b',
'type': 'word'},
{'end_offset': 18,
'position': 6,
'start_offset': 8,
'token': '5225461204',
'type': 'word'},
{'end_offset': 19,
'position': 7,
'start_offset': 18,
'token': 'e',
'type': 'word'},
{'end_offset': 24,
'position': 8,
'start_offset': 19,
'token': '75837',
'type': 'word'},
{'end_offset': 25,
'position': 9,
'start_offset': 24,
'token': 'f',
'type': 'word'},
{'end_offset': 32,
'position': 10,
'start_offset': 25,
'token': '1616575',
'type': 'word'}]}

请问这样分词的依据是什么啊？谢谢！

plugin-descriptorn改成5.3.0报错了

Could not find a suitable constructor in org.elasticsearch.rest.RestController. Classes must have either one (and only one) constructor annotated with @Inject or a zero-argument constructor that is not private.
at org.elasticsearch.rest.RestController.class(Unknown Source)
while locating org.elasticsearch.rest.RestController
for parameter 1 at org.ansj.elasticsearch.cat.AnsjCatAction.(Unknown Source)
at unknown

2 errors
at org.elasticsearch.common.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:361) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.inject.InjectorBuilder.initializeStatically(InjectorBuilder.java:137) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:93) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:96) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:70) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.inject.ModulesBuilder.createInjector(ModulesBuilder.java:43) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.node.Node.(Node.java:482) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.node.Node.(Node.java:238) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.bootstrap.Bootstrap$6.(Bootstrap.java:242) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:242) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:360) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:123) ~[elasticsearch-5.3.0.jar:5.3.0]
... 6 more

分词的一个问题

添加了一些词条之后，出现一些问题：

通过index_ansj对三打白骨精进行分词：

没有最细粒度做分词

强制分词的结果数量好像有条数限制，容易出现数组越界。

可否对某个索引禁用停词？

比如通过类似下面这样的配置？

PUT /my_index
{
   "settings":{
      "analysis":{
         "analyzer":{
            "default":{
            	"type": "search_ansj",
            	"enabled_stop_filter": false
            }
         }
      }
   }
}

词性使用问题

索引分析阶段，通过ansj对文本进行了分词，会得到词性信息，想对不同的词性进行一些过滤（就像过滤停用词一样），请问如何做？
query查询阶段，query被分词后，每个term会有相应的词性，在排序的时候，不同的词性，可能对应的term weight不一样，所以在排序的时候需要用到词性信息，请问这个怎么做？
谢谢！

能index_ansj能正常分词，但是单独使用tokenizer会提示tokenizer不存在。

curl -XGET 'localhost:9200/_analyze?tokenizer=index_ansj&filter=pinyin&pretty' -d '你好，我是小明的同学小强'
{
"error" : {
"root_cause" : [ {
"type" : "remote_transport_exception",
"reason" : "[node1][127.0.0.1:9300][indices:admin/analyze[s]]"
} ],
"type" : "null_pointer_exception",
"reason" : null
},
"status" : 500
}

analyzer分词正常
curl -XGET 'localhost:9200/_analyze?analyzer=index_ansj&pretty' -d '你好，我是小明的同学小强'
{
"tokens" : [ {
"token" : "你好",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
}, {
"token" : "，",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "我",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 2
}, {
"token" : "是",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 3
}, {
"token" : "小明",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 4
}, {
"token" : "的",
"start_offset" : 7,
"end_offset" : 8,
"type" : "word",
"position" : 5
}, {
"token" : "同学",
"start_offset" : 8,
"end_offset" : 10,
"type" : "word",
"position" : 6
}, {
"token" : "小强",
"start_offset" : 10,
"end_offset" : 12,
"type" : "word",
"position" : 7
} ]
}

为什么position是递增的，而不是和原文本一一对应

下面例子左边是输入，右边是索引后的terms，我希望所每一组的term都是position都是0，1，2，这样短语搜索就可以使用任何一种组合，例如postion1的terms有 [liu，l]，position2：[de,d],position3:[hua,h]

刘德华,liudehua=>[liu, de, hua]
ldh =>[l,d,h]
刘德h=>[liu,de,h]
l德华=>[l,de,hua]
liudh['liu','d','h']

同时，我希望索引的结果如下：

{
  "tokens": [
    {
      "token": "l",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "d",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "h",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    },
    {
      "token": "liu",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "de",
      "start_offset":1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "hua",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    }
]
}

elasticsearch-analysis-ansj分词把空格也算进去了

http://localhost:9200/docs/_analyze?text=ce%20shi&analyzer=index_ansj
测试发现空格也被算作一个和token,要如何去掉这个?
{"tokens":[{"token":"ce","start_offset":0,"end_offset":2,"type":"en","position":0},{"token":" ","start_offset":2,"end_offset":3,"type":"null","position":1},{"token":"shi","start_offset":3,"end_offset":6,"type":"en","position":2}]}

支持 ElasticSearch 5.0

ElasticSearch 5.0出来了，搞下支持下吧！！！

有几个关于配置上的小问题请教一下

如果有详细的说明文档了，还请告诉我一下，因为找了很久都没找到。。。

redis远程push的方式，有具体的命令说明吗？
我想用来管理默认词典、停用词典等，应该如何配置？
同义词词典怎么配置？
ansj会把标点符号，甚至是空格都保留下来，这对兼容.Net、C++、C#等词很有用，但很多无用的符号，要怎么处理呢，当作停用词放到停用词典里？
除在http://maven.nlpcn.org/down/library/ 中看到「ambiguity.dic」之外，其他词典都没找到范例，不知能否告知一下在哪了解？

如何在elasticsearch.yml 中配置用户自定义词典的路径

如何在elasticsearch.yml 中配置用户自定义词典的路径
请大侠给个例子

高亮混乱问题

使用的是2.3.4版本，中文文章中包含都种标点符号，使用fvh方式显示高亮，获得的高亮却总是混乱，与输入关键词无关，应该是分词索引position错乱了，尝试过滤标点符号，依然失败。

_cat/test/analyze http 接口访问错误！

ES 版本5.0.1
http://127.0.0.1:9200/_cat/test/analyze?text=%E5%85%AD%E5%91%B3%E5%9C%B0%E9%BB%84%E4%B8%B8%E8%BD%AF%E8%83%B6%E5%9B%8A&analyzer=index_ansj

{"error":{"root_cause":[{"type":"null_pointer_exception","reason":null}],"type":"null_pointer_exception","reason":null},"status":500}

单字搜索问题

你好，感谢开发这个分词插件，我在使用中发现了一点问题，想请教一下。
比如我有两条记录：
1.雪野新村
2.上南新村
如果我以雪为关键字，结果为空，但是用雪野为关键字，则可以搜索出记录1
如果我以上为关键字，可以搜索出记录2
请问单字的搜索是怎么处理的？对以上的结果不是很理解，这与分词有关吗

5.1.1 URL不存在

貌似最高版本只有5.0.1

想对插件做些修改，怎么把插件源码和es进行集成测试呢

elasticsearch 2.3.1 版本索引时出错

创建index没有问题，开始索引数据时出现错误：

[2016-04-23 17:45:27,284][WARN ][rest.suppressed          ] /products/product/13 Params: {index=products, id=13, type=product}
RemoteTransportException[[Lightbright][127.0.0.1:9300][indices:data/write/index[p]]]; nested: NullPointerException;
Caused by: java.lang.NullPointerException
    at java.io.Reader.<init>(Reader.java:78)
    at org.ansj.util.AnsjReader.<init>(AnsjReader.java:34)
    at org.ansj.util.AnsjReader.<init>(AnsjReader.java:49)
    at org.ansj.splitWord.analysis.IndexAnalysis.<init>(IndexAnalysis.java:133)
    at org.ansj.lucene5.AnsjAnalyzer.getTokenizer(AnsjAnalyzer.java:95)
    at org.ansj.elasticsearch.index.analysis.AnsjAnalysis$1.create(AnsjAnalysis.java:59)
    at org.elasticsearch.index.analysis.CustomAnalyzer.createComponents(CustomAnalyzer.java:83)
    at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
    at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
    at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:176)
    at org.apache.lucene.document.Field.tokenStream(Field.java:562)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:628)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:365)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1477)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1256)
    at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:530)
    at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
    at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
    at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
    at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:158)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

我的配置文件如下:

index:
  analysis:
    analyzer:
      customer_ansj_index:
        tokenizer: index_ansj
      customer_ansj_query:
        tokenizer: query_ansj

项目clone下来后不能执行mvn clean install

如题，项目clone下来后，执行mvn clean install 报错
[ERROR] Failed to execute goal on project elasticsearch-analysis-ansj: Could not
resolve dependencies for project org.ansj:elasticsearch-analysis-ansj:jar:2.1.1
: The following artifacts could not be resolved: org.ansj:ansj_seg:jar:3.6, org.
ansj:ansj_lucene5_plug:jar:3.0: Failure to find org.ansj:ansj_seg:jar:3.6 in htt
p://repo1.maven.org/maven2/ was cached in the local repository, resolution will
not be reattempted until the update interval of repo1 has elapsed or updates are
forced -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e swit
ch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please rea
d the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyReso
lutionException

增加依赖

升级ansj的版本，旧版本找不到了。然后需要依赖单独的tree_split，这个居然没有在ansj里面引用···

            <groupId>org.ansj</groupId>
            <artifactId>ansj_seg</artifactId>
            <version>1.4</version>
            <classifier>min</classifier>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.ansj</groupId>
            <artifactId>tree_split</artifactId>
            <version>1.2</version>
            <scope>compile</scope>
        </dependency>

关于isRealName属性无法区分英文大小写问题

您好，分词器已设置了isRealName=true，但还是无法进行原词分词，还是将其全部转换为了小写，bug问题在AnsjElasticConfigurator类：

需互换红色框俩处代码，以及配置文件：
ansj:
#默认参数配置
isNameRecognition: true #开启姓名识别
isNumRecognition: true #开启数字识别
isQuantifierRecognition: true #是否数字和量词合并
isRealName: true; #是否保留真实词语,建议保留false

望更新，谢谢！^_^

无法在自定义analyzer中添加index_ansj

elasticsearch版本：2.3.3

elasticsearch.yml配置如下：

index :
  analysis :
    tokenizer :
      index_ansj :
        type : index_ansj
    filter :
      ini_synonym :
        type : synonym
        synonyms_path: ansj/dic/synonym.txt
    analyzer :
      custom1 :
        tokenizer : index_ansj
        filter : [ini_synonym]


index.analysis.analyzer.default.type: custom1

启动的时候会打印日志：

[.kibana] IndexCreationException[failed to create index]; nested: IllegalArgumentException[Unknown Analyzer type [custom1] for [default]];
        at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:362)
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewIndices(IndicesClusterStateService.java:294)
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:163)
        at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:610)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: Unknown Analyzer type [custom1] for [default]
        at org.elasticsearch.index.analysis.AnalysisModule.configure(AnalysisModule.java:320)
        at org.elasticsearch.common.inject.AbstractModule.configure(AbstractModule.java:60)
        at org.elasticsearch.common.inject.spi.Elements$RecordingBinder.install(Elements.java:233)
        at org.elasticsearch.common.inject.spi.Elements.getElements(Elements.java:105)
        at org.elasticsearch.common.inject.InjectorShell$Builder.build(InjectorShell.java:143)
        at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:99)
        at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:157)
        at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:55)
        at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:358)

支持远程分词库吗,类似ik

https://github.com/medcl/elasticsearch-analysis-ik

[邓超过生日玩动漫游戏] 分词疑惑

如图...
1、邓超过问题（如图1）：
歧义词典中有「邓超 n 过 v 生日 n」，但search_ansj仍然分成了「邓超过」。

2、动漫游问题（如图2）：
词库里有「动漫」「游戏」「动漫游」「漫游」「动漫游戏」，歧义词典中还有「动漫 n 游戏 n」
index_ansj 为何将「动漫」「动漫游」分出来了，但「游戏」「漫游」没有呢（虽然漫游不应该分出来）？

redis发布ambiguity及删除问题

1、执行redis命令：publish ansj_term a:c:减肥瘦身-减肥,nr,瘦身,v
在ambiguity.dic词典中会增加一行记录
减肥瘦身减肥 nr 瘦身 v
而ansj_seg对于该词典的记录形式如：
李民 nr 工作 vn
三个 m 和尚 n
的确 d 定 v 不 v
大 a 和尚 n
它是双数的记录形式。

发现该处，导致在启动ES的时候会发现日志打印该信息

2、执行redis命令：publish ansj_term a:d:减肥瘦身会清空ambiguity.dic文档所有内容

用回之前的2.3.4版本的方法可以删除。
续：修正IO异常时导致的内存泄漏

求救：elasticsearch2.3.x版本配置文件elasticsearch.yml怎么配置redis 啊？

我是按作者文档来配置，但启动es时，一直报：没有找到redis相关配置!
说明一下：我是前启动redis再启动es的，并且都是在同一台服务器上，没有网络问题。
求高手帮忙解答，谢谢！！！

ansj elasticsearch插件停词问题

日志显示停词库加成功，但是分词结果中仍然有停词，求助！
我试过每行一个单词，以及如下这样

与    p    1000
专业    n    1000

两种停词库格式
es版本2.3.3

简单配置是不是不能通过redis添加词库？

请问：什么时候有5.1.1的版本？

启用redis后报告字典位置是config目录不是plugin目录

另一个建议默认配置文件的redis地址改为127.0.0.1

index_ansj分词疑惑

使用index_ansj模式进行分词。会把单字也切分出来。拿作者的例子“六味地黄丸软胶囊“。切分结果中包含了"六、味、地、黄、丸、软、胶、囊"，与作者描述的不一致。

使用高级配置提示没有找到redis相关配置

日志

[2014-09-25 22:15:09,251][INFO ][cluster.service          ] [Atalon] new_master [Atalon][oeOrI-8rTuSh9AQiADXVpw][db01.mst365.cn][inet[/10.171.229.120:9300]], reason: zen-disco-join (elected_as_master)
[2014-09-25 22:15:09,281][INFO ][http                     ] [Atalon] bound_address {inet[/0.0.0.0:9200]}, publish_address {inet[/10.171.229.120:9200]}
[2014-09-25 22:15:09,282][INFO ][node                     ] [Atalon] started
[2014-09-25 22:15:11,947][INFO ][ansj-analyzer            ] ansj分词器预热完毕，可以使用!
[2014-09-25 22:15:11,947][INFO ][ansj-analyzer            ] 没有找到redis相关配置!
[2014-09-25 22:15:12,417][INFO ][gateway                  ] [Atalon] recovered [1] indices into cluster_state
[2014-09-25 22:23:57,245][INFO ][node                     ] [Atalon] stopping ...
[2014-09-25 22:23:57,279][INFO ][node                     ] [Atalon] stopped
[2014-09-25 22:23:57,280][INFO ][node                     ] [Atalon] closing ...
[2014-09-25 22:23:57,288][INFO ][node                     ] [Atalon] closed

能不能提供一个raw的配置文件，直接复制readme的配置容易格式容易出问题！！

另外，yaml一般都是2格缩进的。

有redis重连功能吗

请问，当redis服务器故障时，该插件是否有自动重连功能，如果有，该如何配置呢。

我使用的是elasticsearch的2.4.4版本，没有对应版本的插件，我使用2.4.1的对应插件会有问题吗？

利用 curl 设置 mapping可以写一个详细的例子吗？

利用 curl 设置 mapping可以写一个详细的例子吗？
调试真不太方便...

支持5.0.1有问题

5.0.1似乎接口改变比较多，插件支持不过

找不到redis

ip: master.redis.yao.com:6379
这个配置找不到redis···

分词疑惑

里皮带国家队，分词成里/皮带/国家队

请问该如何解决？

ambiguity词典自定义路径失效，redis发布报错

配置文件如下：

ambiguity_path: "/绝对路径/config/ansj/dic/ambiguity.dic"

通过redis-cli发布

publish ansj_term a:c:减肥瘦身-减肥,nr,瘦身,v

错误日志如下：

[2016-09-19 09:13:06,710][ERROR][ansj-redis-msg-file      ] appendAMB exception
java.security.PrivilegedActionException: java.io.FileNotFoundException: ansj/dic/ambiguity.dic (没有那个文件或目录)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.ansj.elasticsearch.pubsub.redis.FileUtils.appendFile(FileUtils.java:71)
        at org.ansj.elasticsearch.pubsub.redis.FileUtils.appendAMB(FileUtils.java:58)
        at org.ansj.elasticsearch.pubsub.redis.AddTermRedisPubSub.onMessage(AddTermRedisPubSub.java:38)
        at redis.clients.jedis.JedisPubSub.process(JedisPubSub.java:113)
        at redis.clients.jedis.JedisPubSub.proceed(JedisPubSub.java:83)
        at redis.clients.jedis.Jedis.subscribe(Jedis.java:1974)
        at org.ansj.elasticsearch.index.config.AnsjElasticConfigurator$1.run(AnsjElasticConfigurator.java:93)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: ansj/dic/ambiguity.dic (没有那个文件或目录)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
        at java.io.FileWriter.<init>(FileWriter.java:107)
        at org.ansj.elasticsearch.pubsub.redis.FileUtils$1.run(FileUtils.java:74)
        ... 9 more
java.security.PrivilegedActionException: java.io.FileNotFoundException: ansj/dic/ambiguity.dic (没有那个文件或目录)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.ansj.elasticsearch.pubsub.redis.FileUtils.appendFile(FileUtils.java:71)
        at org.ansj.elasticsearch.pubsub.redis.FileUtils.appendAMB(FileUtils.java:58)
        at org.ansj.elasticsearch.pubsub.redis.AddTermRedisPubSub.onMessage(AddTermRedisPubSub.java:38)
        at redis.clients.jedis.JedisPubSub.process(JedisPubSub.java:113)
        at redis.clients.jedis.JedisPubSub.proceed(JedisPubSub.java:83)
        at redis.clients.jedis.Jedis.subscribe(Jedis.java:1974)
        at org.ansj.elasticsearch.index.config.AnsjElasticConfigurator$1.run(AnsjElasticConfigurator.java:93)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: ansj/dic/ambiguity.dic (没有那个文件或目录)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
        at java.io.FileWriter.<init>(FileWriter.java:107)
        at org.ansj.elasticsearch.pubsub.redis.FileUtils$1.run(FileUtils.java:74)
        ... 9 more

之后跟踪FileUtils.java发现

怀疑版本代码升级中将相对路径丢失后引起错误

项目打包后es无法加载ansj

es1.5.1。
从maven私服上下载zip可以运行，mvn package的zip运行报错，java.lang.ClassNotFoundException: org.elasticsearch.index.analysis.ansjindex.AnsjIndexAnalyzerProvider，路径很奇怪，没有找到原因。
烦请帮忙看下，多谢！

完整堆栈如下
org.elasticsearch.indices.IndexCreationException: [group_20160108_172451] failed to create index at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:330) at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewIndices(IndicesClusterStateService.java:311) at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:180) at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:467) at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:188) at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:158) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: failed to find analyzer type [ansj_index] or tokenizer for [index_ansj] at org.elasticsearch.index.analysis.AnalysisModule.configure(AnalysisModule.java:372) at org.elasticsearch.common.inject.AbstractModule.configure(AbstractModule.java:60) at org.elasticsearch.common.inject.spi.Elements$RecordingBinder.install(Elements.java:204) at org.elasticsearch.common.inject.spi.Elements.getElements(Elements.java:85) at org.elasticsearch.common.inject.InjectorShell$Builder.build(InjectorShell.java:130) at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:99) at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:131) at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:69) at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:328) ... 8 more Caused by: org.elasticsearch.common.settings.NoClassSettingsException: Failed to load class setting [type] with value [ansj_index] at org.elasticsearch.common.settings.ImmutableSettings.loadClass(ImmutableSettings.java:476) at org.elasticsearch.common.settings.ImmutableSettings.getAsClass(ImmutableSettings.java:464) at org.elasticsearch.index.analysis.AnalysisModule.configure(AnalysisModule.java:356) ... 16 more Caused by: java.lang.ClassNotFoundException: org.elasticsearch.index.analysis.ansjindex.AnsjIndexAnalyzerProvider at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) 2698,2-9 96% at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:158) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: failed to find analyzer type [ansj_index] or tokenizer for [index_ansj] at org.elasticsearch.index.analysis.AnalysisModule.configure(AnalysisModule.java:372) at org.elasticsearch.common.inject.AbstractModule.configure(AbstractModule.java:60) at org.elasticsearch.common.inject.spi.Elements$RecordingBinder.install(Elements.java:204) at org.elasticsearch.common.inject.spi.Elements.getElements(Elements.java:85) at org.elasticsearch.common.inject.InjectorShell$Builder.build(InjectorShell.java:130) at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:99) at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:131) at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:69) at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:328) ... 8 more Caused by: org.elasticsearch.common.settings.NoClassSettingsException: Failed to load class setting [type] with value [ansj_index] at org.elasticsearch.common.settings.ImmutableSettings.loadClass(ImmutableSettings.java:476) at org.elasticsearch.common.settings.ImmutableSettings.getAsClass(ImmutableSettings.java:464) at org.elasticsearch.index.analysis.AnalysisModule.configure(AnalysisModule.java:356) ... 16 more Caused by: java.lang.ClassNotFoundException: org.elasticsearch.index.analysis.ansjindex.AnsjIndexAnalyzerProvider at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.elasticsearch.common.settings.ImmutableSettings.loadClass(ImmutableSettings.java:474) ... 18 more

如何让词性可以在返回的analyze结果中显示?

您好，目前analyze的结果返回:

}, {
"token" : "大爷",
"start_offset" : 55,
"end_offset" : 57,
"type" : "word",
"position" : 32
}, {
"token" : "的",
"start_offset" : 57,
"end_offset" : 58,
"type" : "word",
"position" : 33
} ]

如何让type或者添加一个字段让返回的信息包含词性信息?

说明md中「黄丸软」是怎么分出来的啊？

我看了下字典中没有这个词啊，为什么会分一个这么不合理的词出来呢？

ES 1.7.2 按 1.x 版本指令安装后不能使用

ES 1.7.2 按 1.x 版本指令安装后不能使用，运行：

curl '127.0.0.1:9200/_analyze?analyzer=query_ansj' -d '中华人民共和国'

出现下列错误：

{"error":"ElasticsearchIllegalArgumentException[failed to find analyzer [query_ansj]]","status":400}

换成 search_ansj，index_ansj 也报同样的错误。

publish ansj_term u:c:视康
publish ansj_term u:d:视康
publish ansj_term a:c:减肥瘦身-减肥,nr,瘦身,v

u:c
u:d
...
表示的是什么意思？

请问如何支持elasticsearch 2.2 版本

ERROR: Plugin [elasticsearch.analysis.ansj] is incompatible with Elasticsearch [2.2.1]. Was designed for version [2.1.1]