Code Monkey home page Code Monkey logo

elasticsearch-analysis-ansj's Introduction

elasticsearch-analysis-ansj: elasticsearch 的中文分词插件

前言

elasticsearch-analysis-ansj 是一个基于 ansj 分词算法 的 elasticsearch 的中文分词插件。

编译

mvn package

编译成功后,将会生成打包好的插件压缩包:target/releases/elasticsearch-analysis-ansj-<版本号>-release.zip

安装

安装命令

在 es 安装目录下执行下面的命令安装插件:

./bin/elasticsearch-plugin install file:///<你的路径>/elasticsearch-analysis-ansj-<版本号>-release.zip

安装完成后,会生成一个默认的配置文件: <ES_HOME>/config/elasticsearch-analysis-ansj/ansj.cfg.yml,根据需要修改此文件即可。

测试

安装完成后,启动 es 集群。通过以下方式测试是否安装正确:
方法一:
通过 kibana 执行 GET /_cat/ansj?text=**&type=index_ansj 命令,测试 index_ansj 分词器,返回内容如下:

{
  "result": [
    {
      "name": "**",
      "nature": "ns",
      "offe": 0,
      "realName": "**",
      "synonyms": null
    },
    {
      "name": "",
      "nature": "f",
      "offe": 0,
      "realName": "",
      "synonyms": null
    },
    {
      "name": "",
      "nature": "n",
      "offe": 1,
      "realName": "",
      "synonyms": null
    }
  ]
}

方法二:
通过 kibana 执行 GET /_cat/ansj/config 命令,获取配置文件内容如下:

{
  "ambiguity": [
    "ambiguity"
  ],
  "stop": [
    "stop"
  ],
  "synonyms": [
    "synonyms"
  ],
  "crf": [
    "crf"
  ],
  "isQuantifierRecognition": "true",
  "isRealName": "false",
  "isNumRecognition": "true",
  "isNameRecognition": "true",
  "dic": [
    "dic"
  ]
}

使用

  • 第一步:创建索引
PUT /test_index?pretty
{
  "settings" : {
    "index" : {
      "number_of_shards" : 16,
      "number_of_replicas" : 1,
      "refresh_interval":"5s"
    }
  },
  "mappings" : {
    "properties" : {
      "test_field": { 
        "type": "text",
        "analyzer": "index_ansj",
        "search_analyzer": "query_ansj"
      }
    }
  }
}

说明:

  • test_index: 用于测试的索引名称;
  • test_field: 用于测试的字段;
  • 指定字段的索引分词器为: index_ansj
  • 指定字段的搜索分词器为: query_ansj

测试索引配置是否正确:

POST /test_index/_analyze
{
  "field": "test_field",
  "text": "**"
}
  • 第二步:添加数据
PUT test_index/_bulk?refresh
{"create":{ }}
{ "test_field" : "**" }
{"create":{ }}
{ "test_field" : "中华人民共和国" }
{"create":{ }}
{ "test_field" : "**有56个民族" }
{"create":{ }}
{ "test_field" : "**是社会主义国家" }
  • 第三步:执行搜索
GET test_index/_search
{
  "query": {
    "match": {
      "test_field": {
        "query": "**"
      }
    }
  }
}

注意:

  • 上述操作语句都是在 kibanadev_tools 里执行的;
  • 上述操作语句仅在 es 8.x 版本上测试过,其它版本请根据实际情况调整。

插件功能

安装插件后,在 es 集群中会增加以下功能:

三个 analyzer:

  • index_ansj (建议索引使用)
  • query_ansj (建议搜索使用)
  • dic_ansj

三个 tokenizer:

  • index_ansj (建议索引使用)
  • query_ansj (建议搜索使用)
  • dic_ansj

http 接口:

  • /_cat/ansj: 执行分词
  • /_cat/ansj/config: 显示全部配置
  • /_ansj/flush/config: 刷新全部配置
  • /_ansj/flush/dic: 更新全部词典。包括用户自定义词典,停用词典,同义词典,歧义词典,crf

配置文件

配置文件格式

ansj:
  #默认参数配置
  isNameRecognition: true #开启姓名识别
  isNumRecognition: true #开启数字识别
  isQuantifierRecognition: true #是否数字和量词合并
  isRealName: false #是否保留真实词语,建议保留false

  #用户自定词典配置
  #dic: default.dic #也可以写成 file://default.dic , 如果未配置dic,则此词典默认加载
  # http方式加载
  #dic_d1: http://xxx/xx.dic
  # jar中文件加载
  #dic_d2: jar://org.ansj.dic.DicReader|/dic2.dic
  # 从数据库中加载
  #dic_d3: jdbc://jdbc:mysql://xxxx:3306/ttt?useUnicode=true&characterEncoding=utf-8&zeroDateTimeBehavior=convertToNull|username|password|select name as name,nature,freq from dic where type=1
  # 从自定义类中加载,YourClas  extends PathToStream
  #dic_d3: class://xxx.xxx.YourClas|ohterparam

  #过滤词典配置
  #stop: http,file,jar,class,jdbc 都支持
  #stop_key1: ...

  #歧义词典配置
  #ambiguity: http,file,jar,class,jdbc 都支持
  #ambiguity_key1: ...

  #同义词词典配置
  #synonyms: http,file,jar,class,jdbc 都支持
  #synonyms_key1: ...

配置文件示例

使用本地文件词库

ansj:
  # 开启姓名识别
  isNameRecognition: false
  # 开启数字识别
  isNumRecognition: true
  # 是否数字和量词合并
  isQuantifierRecognition: false
  # 是否保留真实词语
  isRealName: false
  # 词典
  dic: file:///data/elasticsearch-dic/ansj/main.dic
  # 停词(过滤词)词典
  stop: file:///data/elasticsearch-dic/ansj/stop.dic
  # 歧义词词典配置
  ambiguity: file:///data/elasticsearch-dic/ansj/ambiguity.dic
  # 同义词词典配置
  synonyms: file:///data/elasticsearch-dic/ansj/synonyms.dic

使用 HTTP 协议加载词库

ansj:
  # 开启姓名识别
  isNameRecognition: false
  # 开启数字识别
  isNumRecognition: true
  # 是否数字和量词合并
  isQuantifierRecognition: false
  # 是否保留真实词语
  isRealName: false
  # 词典
  dic: http://example.com/elasticsearch-dic/ansj/main.dic
  # 停词(过滤词)词典
  stop: http://example.com/elasticsearch-dic/ansj/stop.dic
  # 歧义词词典配置
  ambiguity: http://example.com/elasticsearch-dic/ansj/ambiguity.dic
  # 同义词词典配置
  synonyms: http://example.com/elasticsearch-dic/ansj/synonyms.dic

插件版本与 ES 版本的对应关系

plugin elasticsearch
1.0.0 0.90.2
1.x 1.x
2.1.1 2.1.1
2.3.1 2.3.1
2.3.2 2.3.2
2.3.3 2.3.3
2.3.4 2.3.4
2.3.5 2.3.5
2.4.0 2.4.0
2.4.1 2.4.1
2.4.2 2.4.2
2.4.3 2.4.3
2.4.4 2.4.4
2.4.5 2.4.5
2.4.6 2.4.6
5.0.0 5.0.0
5.0.1 5.0.1
5.0.2 5.0.2
5.1.1 5.1.1
5.1.2 5.1.2
5.2.0 5.2.0
5.2.1 5.2.1
5.2.2 5.2.2
5.3.0 5.3.0
5.3.1 5.3.1
5.3.2 5.3.2
5.3.3 5.3.3
5.4.0 5.4.0
5.4.1 5.4.1
5.4.2 5.4.2
5.4.3 5.4.3
5.5.0 5.5.0
5.5.1 5.5.1
5.5.2 5.5.2
5.5.3 5.5.3
5.6.0 5.6.0
5.6.1 5.6.1
5.6.2 5.6.2
5.6.3 5.6.3
5.6.4 5.6.4
5.6.5 5.6.5
5.6.6 5.6.6
5.6.7 5.6.7
5.6.8 5.6.8
5.6.9 5.6.9
5.6.10 5.6.10
5.6.11 5.6.11
5.6.12 5.6.12
5.6.13 5.6.13
5.6.14 5.6.14
5.6.15 5.6.15
5.6.16 5.6.16
6.0.0 6.0.0
6.0.1 6.0.1
6.1.0 6.1.0
6.1.1 6.1.1
6.1.2 6.1.2
6.1.3 6.1.3
6.1.4 6.1.4
6.2.0 6.2.0
6.2.1 6.2.1
6.2.2 6.2.2
6.2.3 6.2.3
6.2.4 6.2.4
6.3.0 6.3.0
6.3.1 6.3.1
6.3.2 6.3.2
6.4.0 6.4.0
6.4.1 6.4.1
6.4.2 6.4.2
6.4.3 6.4.3
6.5.0 6.5.0
6.5.1 6.5.1
6.5.2 6.5.2
6.5.3 6.5.3
6.5.4 6.5.4
6.6.0 6.6.0
6.6.1 6.6.1
6.6.2 6.6.2
6.7.0 6.7.0
6.7.1 6.7.1
6.7.2 6.7.2
6.8.0 6.8.0
6.8.1 6.8.1
6.8.2 6.8.2
6.8.3 6.8.3
6.8.4 6.8.4
6.8.5 6.8.5
6.8.6 6.8.6
6.8.7 6.8.7
6.8.8 6.8.8
6.8.9 6.8.9
6.8.10 6.8.10
6.8.11 6.8.11
6.8.12 6.8.12
6.8.13 6.8.13
6.8.14 6.8.14
6.8.15 6.8.15
6.8.16 6.8.16
6.8.17 6.8.17
6.8.18 6.8.18
6.8.19 6.8.19
6.8.20 6.8.20
6.8.21 6.8.21
6.8.22 6.8.22
6.8.23 6.8.23
7.0.0 7.0.0
7.0.1 7.0.1
7.1.0 7.1.0
7.1.1 7.1.1
7.2.0 7.2.0
7.2.1 7.2.1
7.3.0 7.3.0
7.3.1 7.3.1
7.3.2 7.3.2
7.4.0 7.4.0
7.4.1 7.4.1
7.4.2 7.4.2
7.5.0 7.5.0
7.5.1 7.5.1
7.5.2 7.5.2
7.6.0 7.6.0
7.6.1 7.6.1
7.6.2 7.6.2
7.7.0 7.7.0
7.7.1 7.7.1
7.8.0 7.8.0
7.8.1 7.8.1
7.9.0 7.9.0
7.9.1 7.9.1
7.9.2 7.9.2
7.9.3 7.9.3
7.17.5 7.17.5
7.17.7 7.17.7
7.17.8 7.17.8
7.17.9 7.17.9
7.17.10 7.17.10
7.17.11 7.17.11
7.17.12 7.17.12
7.17.13 7.17.13
7.17.14 7.17.14
7.17.15 7.17.15
7.17.16 7.17.16
8.3.3 8.3.3
8.5.3 8.5.3
8.6.0 8.6.0
8.6.1 8.6.1
8.6.2 8.6.2
8.7.0 8.7.0
8.7.1 8.7.1
8.8.0 8.8.0
8.8.1 8.8.1
8.8.2 8.8.2
8.9.0 8.9.0
8.9.1 8.9.1
8.9.2 8.9.2
8.10.0 8.10.0
8.10.1 8.10.1
8.10.2 8.10.2
8.10.3 8.10.3
8.10.4 8.10.4
8.11.0 8.11.0
8.11.1 8.11.1
8.11.2 8.11.2
8.11.3 8.11.3

版权

elasticsearch-analysis-ansj is licenced under the Apache License Version 2.0. See the LICENSE file for details.

elasticsearch-analysis-ansj's People

Contributors

4eversm avatar ansjsun avatar clyuz avatar defp avatar dependabot[bot] avatar fossabot avatar hanbj avatar shi-yuan avatar timzaak avatar yyhan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch-analysis-ansj's Issues

elasticsearch2.1.1与ansj的集成

为什么我的elasticsearch2.1.1版本的安装上ansj后高亮显示的还是单字拆分的呢?而且怎么没有停用词词库呢?

redis更新词库的配置文件

分词文件配置里面的ansj配置是需要些在elasticsearch.yml吗?我写进去还是提示没有找到
[2016-09-14 18:16:24,400][INFO ][ansj-initializer ] 没有找到redis相关配置!

通过redis发布词条的功能失效

elasticsearch版本:2.3.3
插件版本:2.3.3.2

添加-Des.security.manager.enabled=false启动有报错提示:

redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource from the pool
    at redis.clients.util.Pool.getResource(Pool.java:22)
    at org.ansj.elasticsearch.pubsub.redis.RedisUtils.getConnection(RedisUtils.java:22)
    at org.ansj.elasticsearch.index.config.AnsjElasticConfigurator$1.run(AnsjElasticConfigurator.java:85)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.NoSuchElementException: Could not create a validated object, cause: ValidateObject failed
    at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1203)
    at redis.clients.util.Pool.getResource(Pool.java:20)
    ... 3 more
[2016-07-02 21:23:23,838][ERROR][ansj-redis-utils         ] Could not get a resource from the pool
redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource from the pool
    at redis.clients.util.Pool.getResource(Pool.java:22)
    at org.ansj.elasticsearch.pubsub.redis.RedisUtils.getConnection(RedisUtils.java:22)
    at org.ansj.elasticsearch.index.config.AnsjElasticConfigurator$1.run(AnsjElasticConfigurator.java:85)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.NoSuchElementException: Could not create a validated object, cause: ValidateObject failed
    at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1203)
    at redis.clients.util.Pool.getResource(Pool.java:20)
    ... 3 more
[2016-07-02 21:23:23,839][INFO ][ansj-initializer         ] redis守护线程准备完毕,ip:127.0.0.1:6379,port:6379,channel:ansj_term
Exception in thread "Thread-3" java.lang.NullPointerException
    at java.util.Objects.requireNonNull(Objects.java:203)
    at org.ansj.elasticsearch.index.config.AnsjElasticConfigurator$1.run(AnsjElasticConfigurator.java:89)
    at java.lang.Thread.run(Thread.java:745)

通过channal发布,没有产生ext.dic 文件而且添加的词条也没起作用。

MD5 Hash字符串被分词

Elasticsearch 1.7.2
ansj 插件 1.x 版本

类型 string,未指定 not_analyze, 值 fd04fe9b5225461204e75837f1616575, 分词器 index_ansj

该值被分词结果如下:

{'tokens': [{'end_offset': 2,
'position': 1,
'start_offset': 0,
'token': 'fd',
'type': 'word'},
{'end_offset': 4,
'position': 2,
'start_offset': 2,
'token': '04',
'type': 'word'},
{'end_offset': 6,
'position': 3,
'start_offset': 4,
'token': 'fe',
'type': 'word'},
{'end_offset': 7,
'position': 4,
'start_offset': 6,
'token': '9',
'type': 'word'},
{'end_offset': 8,
'position': 5,
'start_offset': 7,
'token': 'b',
'type': 'word'},
{'end_offset': 18,
'position': 6,
'start_offset': 8,
'token': '5225461204',
'type': 'word'},
{'end_offset': 19,
'position': 7,
'start_offset': 18,
'token': 'e',
'type': 'word'},
{'end_offset': 24,
'position': 8,
'start_offset': 19,
'token': '75837',
'type': 'word'},
{'end_offset': 25,
'position': 9,
'start_offset': 24,
'token': 'f',
'type': 'word'},
{'end_offset': 32,
'position': 10,
'start_offset': 25,
'token': '1616575',
'type': 'word'}]}

请问这样分词的依据是什么啊?谢谢!

plugin-descriptorn改成5.3.0报错了

  1. Could not find a suitable constructor in org.elasticsearch.rest.RestController. Classes must have either one (and only one) constructor annotated with @Inject or a zero-argument constructor that is not private.
    at org.elasticsearch.rest.RestController.class(Unknown Source)
    while locating org.elasticsearch.rest.RestController
    for parameter 1 at org.ansj.elasticsearch.cat.AnsjCatAction.(Unknown Source)
    at unknown

2 errors
at org.elasticsearch.common.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:361) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.inject.InjectorBuilder.initializeStatically(InjectorBuilder.java:137) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:93) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:96) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:70) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.inject.ModulesBuilder.createInjector(ModulesBuilder.java:43) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.node.Node.(Node.java:482) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.node.Node.(Node.java:238) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.bootstrap.Bootstrap$6.(Bootstrap.java:242) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:242) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:360) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:123) ~[elasticsearch-5.3.0.jar:5.3.0]
... 6 more

分词的一个问题

添加了一些词条之后,出现一些问题:

通过index_ansj三打白骨精进行分词:
image
没有最细粒度做分词

强制分词的结果数量好像有条数限制,容易出现数组越界。

可否对某个索引禁用停词?

比如通过类似下面这样的配置?

PUT /my_index
{
   "settings":{
      "analysis":{
         "analyzer":{
            "default":{
            	"type": "search_ansj",
            	"enabled_stop_filter": false
            }
         }
      }
   }
}

词性使用问题

  1. 索引分析阶段,通过ansj对文本进行了分词,会得到词性信息,想对不同的词性进行一些过滤(就像过滤停用词一样),请问如何做?
  2. query查询阶段,query被分词后,每个term会有相应的词性,在排序的时候,不同的词性,可能对应的term weight不一样,所以在排序的时候需要用到词性信息,请问这个怎么做?
    谢谢!

能index_ansj能正常分词,但是单独使用tokenizer会提示tokenizer不存在。

curl -XGET 'localhost:9200/_analyze?tokenizer=index_ansj&filter=pinyin&pretty' -d '你好,我是小明的同学小强'
{
"error" : {
"root_cause" : [ {
"type" : "remote_transport_exception",
"reason" : "[node1][127.0.0.1:9300][indices:admin/analyze[s]]"
} ],
"type" : "null_pointer_exception",
"reason" : null
},
"status" : 500
}

analyzer分词正常
curl -XGET 'localhost:9200/_analyze?analyzer=index_ansj&pretty' -d '你好,我是小明的同学小强'
{
"tokens" : [ {
"token" : "你好",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
}, {
"token" : ",",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "我",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 2
}, {
"token" : "是",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 3
}, {
"token" : "小明",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 4
}, {
"token" : "的",
"start_offset" : 7,
"end_offset" : 8,
"type" : "word",
"position" : 5
}, {
"token" : "同学",
"start_offset" : 8,
"end_offset" : 10,
"type" : "word",
"position" : 6
}, {
"token" : "小强",
"start_offset" : 10,
"end_offset" : 12,
"type" : "word",
"position" : 7
} ]
}

为什么position是递增的,而不是和原文本一一对应

下面例子左边是输入,右边是索引后的terms,我希望所每一组的term都是position都是0,1,2,这样短语搜索就可以使用任何一种组合,例如postion1的terms有 [liu,l],position2:[de,d],position3:[hua,h]

  • 刘德华,liudehua=>[liu, de, hua]
  • ldh =>[l,d,h]
  • 刘德h=>[liu,de,h]
  • l德华=>[l,de,hua]
  • liudh['liu','d','h']

同时,我希望索引的结果如下:

{
  "tokens": [
    {
      "token": "l",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "d",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "h",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    },
    {
      "token": "liu",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "de",
      "start_offset":1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "hua",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    }
]
}

elasticsearch-analysis-ansj分词把空格也算进去了

http://localhost:9200/docs/_analyze?text=ce%20shi&analyzer=index_ansj
测试发现空格也被算作一个和token,要如何去掉这个?
{"tokens":[{"token":"ce","start_offset":0,"end_offset":2,"type":"en","position":0},{"token":" ","start_offset":2,"end_offset":3,"type":"null","position":1},{"token":"shi","start_offset":3,"end_offset":6,"type":"en","position":2}]}

有几个关于配置上的小问题请教一下

如果有详细的说明文档了,还请告诉我一下,因为找了很久都没找到。。。

  1. redis远程push的方式,有具体的命令说明吗?

  2. 我想用来管理默认词典、停用词典 等,应该如何配置?

  3. 同义词词典怎么配置?

  4. ansj会把标点符号,甚至是空格都保留下来,这对兼容.Net、C++、C#等词很有用,但很多无用的符号,要怎么处理呢,当作停用词放到停用词典里?

  5. 除在http://maven.nlpcn.org/down/library/ 中看到「ambiguity.dic」之外,其他词典都没找到范例,不知能否告知一下在哪了解?

高亮 混乱问题

使用的是2.3.4版本,中文文章中包含都种标点符号,使用fvh方式显示高亮,获得的高亮却总是混乱,与输入关键词无关,应该是分词索引position错乱了,尝试过滤标点符号,依然失败。

单字搜索问题

你好,感谢开发这个分词插件,我在使用中发现了一点问题,想请教一下。
比如我有两条记录:
1.雪野新村
2.上南新村
如果我以为关键字,结果为空,但是用雪野为关键字,则可以搜索出记录1
如果我以为关键字,可以搜索出记录2
请问单字的搜索是怎么处理的?对以上的结果不是很理解,这与分词有关吗

elasticsearch 2.3.1 版本索引时出错

创建index没有问题,开始索引数据时出现错误:

[2016-04-23 17:45:27,284][WARN ][rest.suppressed          ] /products/product/13 Params: {index=products, id=13, type=product}
RemoteTransportException[[Lightbright][127.0.0.1:9300][indices:data/write/index[p]]]; nested: NullPointerException;
Caused by: java.lang.NullPointerException
    at java.io.Reader.<init>(Reader.java:78)
    at org.ansj.util.AnsjReader.<init>(AnsjReader.java:34)
    at org.ansj.util.AnsjReader.<init>(AnsjReader.java:49)
    at org.ansj.splitWord.analysis.IndexAnalysis.<init>(IndexAnalysis.java:133)
    at org.ansj.lucene5.AnsjAnalyzer.getTokenizer(AnsjAnalyzer.java:95)
    at org.ansj.elasticsearch.index.analysis.AnsjAnalysis$1.create(AnsjAnalysis.java:59)
    at org.elasticsearch.index.analysis.CustomAnalyzer.createComponents(CustomAnalyzer.java:83)
    at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
    at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
    at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:176)
    at org.apache.lucene.document.Field.tokenStream(Field.java:562)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:628)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:365)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1477)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1256)
    at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:530)
    at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
    at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
    at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
    at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:158)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

我的配置文件如下:

index:
  analysis:
    analyzer:
      customer_ansj_index:
        tokenizer: index_ansj
      customer_ansj_query:
        tokenizer: query_ansj

项目clone下来后不能执行mvn clean install

如题,项目clone下来后,执行mvn clean install 报错
[ERROR] Failed to execute goal on project elasticsearch-analysis-ansj: Could not
resolve dependencies for project org.ansj:elasticsearch-analysis-ansj:jar:2.1.1
: The following artifacts could not be resolved: org.ansj:ansj_seg:jar:3.6, org.
ansj:ansj_lucene5_plug:jar:3.0: Failure to find org.ansj:ansj_seg:jar:3.6 in htt
p://repo1.maven.org/maven2/ was cached in the local repository, resolution will
not be reattempted until the update interval of repo1 has elapsed or updates are
forced -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e swit
ch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please rea
d the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyReso
lutionException

增加依赖

升级ansj的版本,旧版本找不到了。然后需要依赖单独的tree_split,这个居然没有在ansj里面引用···

            <groupId>org.ansj</groupId>
            <artifactId>ansj_seg</artifactId>
            <version>1.4</version>
            <classifier>min</classifier>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.ansj</groupId>
            <artifactId>tree_split</artifactId>
            <version>1.2</version>
            <scope>compile</scope>
        </dependency>

关于isRealName属性无法区分英文大小写问题

您好,分词器已设置了isRealName=true,但还是无法进行原词分词,还是将其全部转换为了小写,bug问题在AnsjElasticConfigurator类:
image

需互换红色框俩处代码,以及配置文件:
ansj:
#默认参数配置
isNameRecognition: true #开启姓名识别
isNumRecognition: true #开启数字识别
isQuantifierRecognition: true #是否数字和量词合并
isRealName: true; #是否保留真实词语,建议保留false

望更新,谢谢!^_^

无法在自定义analyzer中添加index_ansj

elasticsearch版本:2.3.3

elasticsearch.yml配置如下:

index :
  analysis :
    tokenizer :
      index_ansj :
        type : index_ansj
    filter :
      ini_synonym :
        type : synonym
        synonyms_path: ansj/dic/synonym.txt
    analyzer :
      custom1 :
        tokenizer : index_ansj
        filter : [ini_synonym]


index.analysis.analyzer.default.type: custom1

启动的时候会打印日志:

[.kibana] IndexCreationException[failed to create index]; nested: IllegalArgumentException[Unknown Analyzer type [custom1] for [default]];
        at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:362)
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewIndices(IndicesClusterStateService.java:294)
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:163)
        at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:610)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: Unknown Analyzer type [custom1] for [default]
        at org.elasticsearch.index.analysis.AnalysisModule.configure(AnalysisModule.java:320)
        at org.elasticsearch.common.inject.AbstractModule.configure(AbstractModule.java:60)
        at org.elasticsearch.common.inject.spi.Elements$RecordingBinder.install(Elements.java:233)
        at org.elasticsearch.common.inject.spi.Elements.getElements(Elements.java:105)
        at org.elasticsearch.common.inject.InjectorShell$Builder.build(InjectorShell.java:143)
        at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:99)
        at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:157)
        at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:55)
        at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:358)

[邓超过生日玩动漫游戏] 分词疑惑

如图...
1、邓超过 问题(如图1):
歧义词典中有「邓超 n 过 v 生日 n」,但search_ansj仍然分成了「邓 超过」。
qq20170317-014058 2x

2、动漫游 问题(如图2):
词库里有「动漫」「游戏」「动漫游」「漫游」「动漫游戏」,歧义词典中还有「动漫 n 游戏 n」
index_ansj 为何将「动漫」「动漫游」分出来了,但「游戏」「漫游」没有呢(虽然 漫游 不应该分出来)?
qq20170317-012614 2x

redis发布ambiguity及删除问题

1、执行redis命令:publish ansj_term a:c:减肥瘦身-减肥,nr,瘦身,v
在ambiguity.dic词典中会增加一行记录
减肥瘦身 减肥 nr 瘦身 v
而ansj_seg对于该词典的记录形式如:
李民 nr 工作 vn
三个 m 和尚 n
的确 d 定 v 不 v
大 a 和尚 n
它是双数的记录形式。
image
发现该处,导致在启动ES的时候会发现日志打印该信息
image

2、执行redis命令:publish ansj_term a:d:减肥瘦身 会清空ambiguity.dic文档所有内容
image
用回之前的2.3.4版本的方法可以删除。
续:修正IO异常时导致的内存泄漏

ansj elasticsearch插件停词问题

日志显示停词库加成功,但是分词结果中仍然有停词,求助!
我试过每行一个单词,以及如下这样

与    p    1000
专业    n    1000

两种停词库格式
es版本2.3.3

index_ansj分词疑惑

使用index_ansj模式进行分词。会把单字也切分出来。拿作者的例子“六味地黄丸软胶囊“。切分结果中包含了"六、味、地、黄、丸、软、胶、囊",与作者描述的不一致。

使用高级配置提示 没有找到redis相关配置

日志

[2014-09-25 22:15:09,251][INFO ][cluster.service          ] [Atalon] new_master [Atalon][oeOrI-8rTuSh9AQiADXVpw][db01.mst365.cn][inet[/10.171.229.120:9300]], reason: zen-disco-join (elected_as_master)
[2014-09-25 22:15:09,281][INFO ][http                     ] [Atalon] bound_address {inet[/0.0.0.0:9200]}, publish_address {inet[/10.171.229.120:9200]}
[2014-09-25 22:15:09,282][INFO ][node                     ] [Atalon] started
[2014-09-25 22:15:11,947][INFO ][ansj-analyzer            ] ansj分词器预热完毕,可以使用!
[2014-09-25 22:15:11,947][INFO ][ansj-analyzer            ] 没有找到redis相关配置!
[2014-09-25 22:15:12,417][INFO ][gateway                  ] [Atalon] recovered [1] indices into cluster_state
[2014-09-25 22:23:57,245][INFO ][node                     ] [Atalon] stopping ...
[2014-09-25 22:23:57,279][INFO ][node                     ] [Atalon] stopped
[2014-09-25 22:23:57,280][INFO ][node                     ] [Atalon] closing ...
[2014-09-25 22:23:57,288][INFO ][node                     ] [Atalon] closed

能不能提供一个raw的配置文件,直接复制readme的配置容易格式容易出问题!!

另外,yaml一般都是2格缩进的。

有redis重连功能吗

请问,当redis服务器故障时,该插件是否有自动重连功能,如果有,该如何配置呢。

找不到redis

ip: master.redis.yao.com:6379
这个配置找不到redis···

分词疑惑

里皮带国家队, 分词成 里/皮带/国家队

请问该如何解决?

ambiguity词典自定义路径失效,redis发布报错

配置文件如下:

ambiguity_path: "/绝对路径/config/ansj/dic/ambiguity.dic"

通过redis-cli发布

publish ansj_term a:c:减肥瘦身-减肥,nr,瘦身,v

错误日志如下:

[2016-09-19 09:13:06,710][ERROR][ansj-redis-msg-file      ] appendAMB exception
java.security.PrivilegedActionException: java.io.FileNotFoundException: ansj/dic/ambiguity.dic (没有那个文件或目录)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.ansj.elasticsearch.pubsub.redis.FileUtils.appendFile(FileUtils.java:71)
        at org.ansj.elasticsearch.pubsub.redis.FileUtils.appendAMB(FileUtils.java:58)
        at org.ansj.elasticsearch.pubsub.redis.AddTermRedisPubSub.onMessage(AddTermRedisPubSub.java:38)
        at redis.clients.jedis.JedisPubSub.process(JedisPubSub.java:113)
        at redis.clients.jedis.JedisPubSub.proceed(JedisPubSub.java:83)
        at redis.clients.jedis.Jedis.subscribe(Jedis.java:1974)
        at org.ansj.elasticsearch.index.config.AnsjElasticConfigurator$1.run(AnsjElasticConfigurator.java:93)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: ansj/dic/ambiguity.dic (没有那个文件或目录)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
        at java.io.FileWriter.<init>(FileWriter.java:107)
        at org.ansj.elasticsearch.pubsub.redis.FileUtils$1.run(FileUtils.java:74)
        ... 9 more
java.security.PrivilegedActionException: java.io.FileNotFoundException: ansj/dic/ambiguity.dic (没有那个文件或目录)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.ansj.elasticsearch.pubsub.redis.FileUtils.appendFile(FileUtils.java:71)
        at org.ansj.elasticsearch.pubsub.redis.FileUtils.appendAMB(FileUtils.java:58)
        at org.ansj.elasticsearch.pubsub.redis.AddTermRedisPubSub.onMessage(AddTermRedisPubSub.java:38)
        at redis.clients.jedis.JedisPubSub.process(JedisPubSub.java:113)
        at redis.clients.jedis.JedisPubSub.proceed(JedisPubSub.java:83)
        at redis.clients.jedis.Jedis.subscribe(Jedis.java:1974)
        at org.ansj.elasticsearch.index.config.AnsjElasticConfigurator$1.run(AnsjElasticConfigurator.java:93)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: ansj/dic/ambiguity.dic (没有那个文件或目录)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
        at java.io.FileWriter.<init>(FileWriter.java:107)
        at org.ansj.elasticsearch.pubsub.redis.FileUtils$1.run(FileUtils.java:74)
        ... 9 more

之后跟踪FileUtils.java发现
1
怀疑版本代码升级中将相对路径丢失后引起错误

项目打包后es无法加载ansj

es1.5.1。
从maven私服上下载zip可以运行,mvn package的zip运行报错,java.lang.ClassNotFoundException: org.elasticsearch.index.analysis.ansjindex.AnsjIndexAnalyzerProvider,路径很奇怪,没有找到原因。
烦请帮忙看下,多谢!

完整堆栈如下
org.elasticsearch.indices.IndexCreationException: [group_20160108_172451] failed to create index at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:330) at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewIndices(IndicesClusterStateService.java:311) at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:180) at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:467) at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:188) at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:158) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: failed to find analyzer type [ansj_index] or tokenizer for [index_ansj] at org.elasticsearch.index.analysis.AnalysisModule.configure(AnalysisModule.java:372) at org.elasticsearch.common.inject.AbstractModule.configure(AbstractModule.java:60) at org.elasticsearch.common.inject.spi.Elements$RecordingBinder.install(Elements.java:204) at org.elasticsearch.common.inject.spi.Elements.getElements(Elements.java:85) at org.elasticsearch.common.inject.InjectorShell$Builder.build(InjectorShell.java:130) at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:99) at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:131) at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:69) at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:328) ... 8 more Caused by: org.elasticsearch.common.settings.NoClassSettingsException: Failed to load class setting [type] with value [ansj_index] at org.elasticsearch.common.settings.ImmutableSettings.loadClass(ImmutableSettings.java:476) at org.elasticsearch.common.settings.ImmutableSettings.getAsClass(ImmutableSettings.java:464) at org.elasticsearch.index.analysis.AnalysisModule.configure(AnalysisModule.java:356) ... 16 more Caused by: java.lang.ClassNotFoundException: org.elasticsearch.index.analysis.ansjindex.AnsjIndexAnalyzerProvider at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) 2698,2-9 96% at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:158) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: failed to find analyzer type [ansj_index] or tokenizer for [index_ansj] at org.elasticsearch.index.analysis.AnalysisModule.configure(AnalysisModule.java:372) at org.elasticsearch.common.inject.AbstractModule.configure(AbstractModule.java:60) at org.elasticsearch.common.inject.spi.Elements$RecordingBinder.install(Elements.java:204) at org.elasticsearch.common.inject.spi.Elements.getElements(Elements.java:85) at org.elasticsearch.common.inject.InjectorShell$Builder.build(InjectorShell.java:130) at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:99) at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:131) at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:69) at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:328) ... 8 more Caused by: org.elasticsearch.common.settings.NoClassSettingsException: Failed to load class setting [type] with value [ansj_index] at org.elasticsearch.common.settings.ImmutableSettings.loadClass(ImmutableSettings.java:476) at org.elasticsearch.common.settings.ImmutableSettings.getAsClass(ImmutableSettings.java:464) at org.elasticsearch.index.analysis.AnalysisModule.configure(AnalysisModule.java:356) ... 16 more Caused by: java.lang.ClassNotFoundException: org.elasticsearch.index.analysis.ansjindex.AnsjIndexAnalyzerProvider at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.elasticsearch.common.settings.ImmutableSettings.loadClass(ImmutableSettings.java:474) ... 18 more

如何让词性可以在返回的analyze结果中显示?

您好,目前analyze的结果返回:

}, {
"token" : "大爷",
"start_offset" : 55,
"end_offset" : 57,
"type" : "word",
"position" : 32
}, {
"token" : "的",
"start_offset" : 57,
"end_offset" : 58,
"type" : "word",
"position" : 33
} ]

如何让type或者添加一个字段让返回的信息包含词性信息?

ES 1.7.2 按 1.x 版本指令安装后不能使用

ES 1.7.2 按 1.x 版本指令安装后不能使用,运行:

curl '127.0.0.1:9200/_analyze?analyzer=query_ansj' -d '中华人民共和国'

出现下列错误:

{"error":"ElasticsearchIllegalArgumentException[failed to find analyzer [query_ansj]]","status":400}

换成 search_ansj,index_ansj 也报同样的错误。

歧义字典的问题

怎么避免index_ansj分词方式中,不将歧义字典中的词再次分开, 歧义字典中有个 不确定 但是分词结果中既有 不确定,也有 确定 。。。 设置了enable_skip_user_define: false

格式规则是什么?

比如

publish ansj_term u:c:视康
publish ansj_term u:d:视康
publish ansj_term a:c:减肥瘦身-减肥,nr,瘦身,v

u:c
u:d
...
表示的是什么意思?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.