worksapplications / elasticsearch-sudachi Goto Github PK

The Japanese analysis plugin for elasticsearch

License: Apache License 2.0

Java 23.84% Groovy 5.86% Kotlin 64.65% Shell 1.73% Python 3.93%

elasticsearch-plugin morphological-analyser

elasticsearch-sudachi's Introduction

analysis-sudachi

analysis-sudachi is an Elasticsearch plugin for tokenization of Japanese text using Sudachi the Japanese morphological analyzer.

What's new?

[3.2.1]
- Fix OOM error with a huge document (#132)
  - Plugin now handles huge documents splitting into relatively small (1M char) chunks.
  - Analysis may be broken around the edge of chunks (open issue, see #131)
- Add tutorial to use Sudachi synonym dictionary (#65)

Check changelog for more.

Build (if necessary)

Build analysis-sudachi.

   $ ./gradlew -PengineVersion=es:8.13.4 build

Use -PengineVersion=os:2.14.0 for OpenSearch.

Supported ElasticSearch versions

8.0.* until 8.13.* supported, integration tests in CI
7.17.* (latest patch version) - supported, integration tests in CI
7.11.* until 7.16.* - best effort support, not tested in CI
7.10.* integration tests for the latest patch version
7.9.* and below - not tested in CI at all, may be broken
7.3.* and below - broken, not supported

Supported OpenSearch versions

2.6.* until 2.14.* supported, integration tests in CI

Installation

Move current dir to $ES_HOME

Install the Plugin

a. Using the release package

$ bin/elasticsearch-plugin install https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v3.1.1/analysis-sudachi-8.13.4-3.1.1.zip

b. Using self-build package

$ bin/elasticsearch-plugin install file:///path/to/analysis-sudachi-8.13.4-3.1.1.zip

(Specify the absolute path in URI format)

Download sudachi dictionary archive from https://github.com/WorksApplications/SudachiDict
Extract dic file and place it to config/sudachi/system_core.dic (You must install system_core.dic in this place if you use Elasticsearch 7.6 or later)
Execute "bin/elasticsearch"

Update Sudachi

If you want to update Sudachi that is included in a plugin you have installed, do the following

Download the latest version of Sudachi from the release page.
Extract the Sudachi JAR file from the zip.
Delete the sudachi JAR file in $ES_HOME/plugins/analysis-sudachi and replace it with the JAR file you extracted in step 2.

Analyzer

An analyzer sudachi is provided. This is equivalent to the following custom analyzer.

{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "default_sudachi_analyzer": {
            "type": "custom",
            "tokenizer": "sudachi_tokenizer",
            "filter": [
              "sudachi_baseform",
              "sudachi_part_of_speech",
              "sudachi_ja_stop"
            ]
          }
        }
      }
    }
  }
}

See following sections for the detail of the tokenizer and each filters.

Tokenizer

The sudachi_tokenizer tokenizer tokenizes input texts using Sudachi.

split_mode: Select splitting mode of Sudachi. (A, B, C) (string, default: C)
- C: Extracts named entities
  - Ex) 選挙管理委員会
- B: Into the middle units
  - Ex) 選挙,管理,委員会
- A: The shortest units equivalent to the UniDic short unit
  - Ex) 選挙,管理,委員,会
discard_punctuation: Select to discard punctuation or not. (bool, default: true)
settings_path: Sudachi setting file path. The path may be absolute or relative; relative paths are resolved with respect to es_config. (string, default: null)
resources_path: Sudachi dictionary path. The path may be absolute or relative; relative paths are resolved with respect to es_config. (string, default: null)
additional_settings: Describes a configuration JSON string for Sudachi. This JSON string will be merged into the default configuration. If this property is set, settings_path will be overridden.

Dictionary

By default, ES_HOME/config/sudachi/sudachi_core.dic is used. You can specify the dictionary either in the file specified by settings_path or by additional_settings. Due to the security manager, you need to put resources (setting file, dictionaries, and others) under the elasticsearch config directory.

Example

tokenizer configuration

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer",
            "split_mode": "C",
            "discard_punctuation": true,
            "resources_path": "/etc/elasticsearch/config/sudachi"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "type": "custom",
            "tokenizer": "sudachi_tokenizer"
          }
        }
      }
    }
  }
}

dictionary settings

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer",
            "additional_settings": "{\"systemDict\":\"system_full.dic\",\"userDict\":[\"user.dic\"]}"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "type": "custom",
            "tokenizer": "sudachi_tokenizer"
          }
        }
      }
    }
  }
}

Filters

sudachi_split

The sudachi_split token filter works like mode of kuromoji.

mode
- "search": Additional segmentation useful for search. (Use C and A mode)
  - Ex）関西国際空港, 関西, 国際, 空港 / アバラカダブラ
- "extended": Similar to search mode, but also unigram unknown words.
  - Ex）関西国際空港, 関西, 国際, 空港 / アバラカダブラ, ア, バ, ラ, カ, ダ, ブ, ラ

Note: In search query, split subwords are handled as a phrase (in the same way to multi-word synonyms). If you want to search with both A/C unit, use multiple tokenizers instead.

PUT sudachi_sample

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": ["my_searchfilter"],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        },
        "filter":{
          "my_searchfilter": {
            "type": "sudachi_split",
            "mode": "search"
          }
        }
      }
    }
  }
}

POST sudachi_sample/_analyze

{
    "analyzer": "sudachi_analyzer",
    "text": "関西国際空港"
}

Which responds with:

{
  "tokens" : [
    {
      "token" : "関西国際空港",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0,
      "positionLength" : 3
    },
    {
      "token" : "関西",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "国際",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "空港",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    }
  ]
}

sudachi_part_of_speech

The sudachi_part_of_speech token filter removes tokens that match a set of part-of-speech tags. It accepts the following setting:

The stoptags is an array of part-of-speech and/or inflection tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analysis-sudachi.jar.

Sudachi POS information is a csv list, consisting 6 items;

1-4 part-of-speech hierarchy (品詞階層)
5 inflectional type (活用型)
6 inflectional form (活用形)

With the stoptags, you can filter out the result in any of these forward matching forms;

1 - e.g., 名詞
1,2 - e.g., 名詞,固有名詞
1,2,3 - e.g., 名詞,固有名詞,地名
1,2,3,4 - e.g., 名詞,固有名詞,地名,一般
5 - e.g., 五段-カ行
6 - e.g., 終止形-一般
5,6 - e.g., 五段-カ行,終止形-一般

PUT sudachi_sample

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": ["my_posfilter"],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        },
        "filter":{
          "my_posfilter":{
            "type":"sudachi_part_of_speech",
            "stoptags":[
              "助詞",
              "助動詞",
              "補助記号,句点",
              "補助記号,読点"
            ]
          }
        }
      }
    }
  }
}

POST sudachi_sample/_analyze

{
  "analyzer": "sudachi_analyzer",
  "text": "寿司がおいしいね"
}

Which responds with:

{
  "tokens": [
    {
      "token": "寿司",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "おいしい",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 2
    }
  ]
}

sudachi_ja_stop

The sudachi_ja_stop token filter filters out Japanese stopwords (japanese), and any other custom stopwords specified by the user. This filter only supports the predefined japanese stopwords list. If you want to use a different predefined list, then use the stop token filter instead.

PUT sudachi_sample

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": ["my_stopfilter"],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        },
        "filter":{
          "my_stopfilter":{
            "type":"sudachi_ja_stop",
            "stopwords":[
              "_japanese_",
              "は",
              "です"
            ]
          }
        }
      }
    }
  }
}

POST sudachi_sample/_analyze

{
  "analyzer": "sudachi_analyzer",
  "text": "私は宇宙人です。"
}

Which responds with:

{
  "tokens": [
    {
      "token": "私",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "宇宙",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 2
    },
    {
      "token": "人",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 3
    }
  ]
}

sudachi_baseform

The sudachi_baseform token filter replaces terms with their Sudachi dictionary form. This acts as a lemmatizer for verbs and adjectives.

This will be overridden by sudachi_split, sudachi_normalizedform or sudachi_readingform token filters.

PUT sudachi_sample

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": ["sudachi_baseform"],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        }
      }
    }
  }
}

POST sudachi_sample/_analyze

{
  "analyzer": "sudachi_analyzer",
  "text": "飲み"
}

Which responds with:

{
  "tokens": [
    {
      "token": "飲む",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    }
  ]
}

sudachi_normalizedform

The sudachi_normalizedform token filter replaces terms with their Sudachi normalized form. This acts as a normalizer for spelling variants. This filter lemmatizes verbs and adjectives too. You don't need to use sudachi_baseform filter with this filter.

This will be overridden by sudachi_split, sudachi_baseform or sudachi_readingform token filters.

PUT sudachi_sample

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": ["sudachi_normalizedform"],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        }
      }
    }
  }
}

POST sudachi_sample/_analyze

{
  "analyzer": "sudachi_analyzer",
  "text": "呑み"
}

Which responds with:

{
  "tokens": [
    {
      "token": "飲む",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    }
  ]
}

sudachi_readingform

The sudachi_readingform token filter replaces the terms with their reading form in either katakana or romaji.

This will be overridden by sudachi_split, sudachi_baseform or sudachi_normalizedform token filters.

Accepts the following setting:

use_romaji
- Whether romaji reading form should be output instead of katakana. Defaults to false.

PUT sudachi_sample

{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "romaji_readingform": {
            "type": "sudachi_readingform",
            "use_romaji": true
          },
          "katakana_readingform": {
            "type": "sudachi_readingform",
            "use_romaji": false
          }
        },
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "romaji_analyzer": {
            "tokenizer": "sudachi_tokenizer",
            "filter": ["romaji_readingform"]
          },
          "katakana_analyzer": {
            "tokenizer": "sudachi_tokenizer",
            "filter": ["katakana_readingform"]
          }
        }
      }
    }
  }
}

POST sudachi_sample/_analyze

{
  "analyzer": "katakana_analyzer",
  "text": "寿司"
}

Returns スシ.

{
  "analyzer": "romaji_analyzer",
  "text": "寿司"
}

Returns susi.

Synonym

There is a temporary way to use Sudachi Dictionary's synonym resource (Sudachi 同義語辞書) with Elasticsearch.

Please refer to this document for the detail.

License

elasticsearch-sudachi's People

Contributors

Stargazers

Watchers

Forkers

hiro511 sansan-inc hinastory ogino miyakelp yi0713 umishu vbkaisetsu takaomag kmotohas nzws thangnvdigdinos hondv kissge sorami y-shimono tomofu74 chikurin66 imxieqing johtani bobbyb815 yuiseki tkykenmt kengotoda bungoume hirano-satoshi littlewat yasuflatland-lf konntoshinobu nemuikoneko kmycode caffeelake togatoga kun432 hogesako morimotoshimei ushitora-anqou udabhas mh-northlander tigerhe7 enthought

elasticsearch-sudachi's Issues

FYI: We're going to remove this bot account

To save seat for GitHub enterprise, @KengoTODA is going to delete this @worksap-bot account.
This account was used to link with SonarCloud previously, like #30, but I think it won't affect current build workflow.

If you found any problem, please mention @KengoTODA to get supported. Thanks in advance!

Fix BUFFER_LENGTH boundary exception

add LICENSE files to distribution ZIP file

Analyzing with explain: true flag produces an exception

!curl -XPOST -s -H 'Content-Type:application/json;' \
"localhost:9200/_analyze?pretty" -d \
'{"text":"駅に行きたい", "tokenizer": "sudachi_tokenizer", "explain":true}'

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "cannot write xcontent for unknown value of type class com.worksap.nlp.sudachi.MorphemeImpl"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "cannot write xcontent for unknown value of type class com.worksap.nlp.sudachi.MorphemeImpl",
    "suppressed" : [
      {
        "type" : "illegal_state_exception",
        "reason" : "Failed to close the XContentBuilder",
        "caused_by" : {
          "type" : "i_o_exception",
          "reason" : "Unclosed object or array found"
        }
      }
    ]
  },
  "status" : 400
}

It seems that ElasticSearch converts all the attributes to XContent and tries to output them to user. Will need to support converting our Attributes to ElasticSearch XContent.

Build with `elasticsearch 8.12.0`

I attempted to build using a Dockerfile for Elasticsearch 8.12.0 but failed at Task :integration:compileTestKotlin.

I would appreciate it if you could advise on a solution. Also, are there any plans to support the latest versions, such as 8.12.0?

FROM openjdk:17-slim as builder

RUN apt-get -y update --fix-missing && \
    apt-get -y install --no-install-recommends git openssl libssl-dev ca-certificates && \
    git clone https://github.com/WorksApplications/elasticsearch-sudachi.git

WORKDIR /elasticsearch-sudachi

USER root

RUN ./gradlew -PengineVersion=es:8.12.0 build --info

Error logs

e: file:///elasticsearch-sudachi/integration/src/test/ext/es-8.00-ge/com.worksap.nlp.elasticsearch.sudachi/aliases.kt:51:30 Type mismatch: inferred type is Stream<AnalysisPlugin!>! but (Mutable)List<AnalysisPlugin!>! was expected
e: file:///elasticsearch-sudachi/integration/src/test/java/com/worksap/nlp/elasticsearch/sudachi/BasicTest.kt:27:44 Unresolved reference: size
e: file:///elasticsearch-sudachi/integration/src/test/java/com/worksap/nlp/elasticsearch/sudachi/BasicTest.kt:29:25 Unresolved reference. None of the following candidates is applicable because of receiver type mismatch: 
public inline fun <T> Array<out TypeVariable(T)>.find(predicate: (TypeVariable(T)) -> Boolean): TypeVariable(T)? defined in kotlin.collections
public inline fun BooleanArray.find(predicate: (Boolean) -> Boolean): Boolean? defined in kotlin.collections
public inline fun ByteArray.find(predicate: (Byte) -> Boolean): Byte? defined in kotlin.collections
public inline fun CharArray.find(predicate: (Char) -> Boolean): Char? defined in kotlin.collections
public inline fun CharSequence.find(predicate: (Char) -> Boolean): Char? defined in kotlin.text
public inline fun DoubleArray.find(predicate: (Double) -> Boolean): Double? defined in kotlin.collections
public inline fun FloatArray.find(predicate: (Float) -> Boolean): Float? defined in kotlin.collections
public inline fun IntArray.find(predicate: (Int) -> Boolean): Int? defined in kotlin.collections
public inline fun LongArray.find(predicate: (Long) -> Boolean): Long? defined in kotlin.collections
public inline fun ShortArray.find(predicate: (Short) -> Boolean): Short? defined in kotlin.collections
public inline fun UByteArray.find(predicate: (UByte) -> Boolean): UByte? defined in kotlin.collections
public inline fun UIntArray.find(predicate: (UInt) -> Boolean): UInt? defined in kotlin.collections
public inline fun ULongArray.find(predicate: (ULong) -> Boolean): ULong? defined in kotlin.collections
public inline fun UShortArray.find(predicate: (UShort) -> Boolean): UShort? defined in kotlin.collections
public inline fun <T> Iterable<TypeVariable(T)>.find(predicate: (TypeVariable(T)) -> Boolean): TypeVariable(T)? defined in kotlin.collections
public inline fun <T> Sequence<TypeVariable(T)>.find(predicate: (TypeVariable(T)) -> Boolean): TypeVariable(T)? defined in kotlin.sequences
e: file:///elasticsearch-sudachi/integration/src/test/java/com/worksap/nlp/elasticsearch/sudachi/BasicTest.kt:30:11 Unresolved reference: it

> Task :compileTestKotlin FAILED
e: file:///elasticsearch-sudachi/src/test/java/com/worksap/nlp/elasticsearch/sudachi/index/SearchEngineEnv.kt:54:35 Type mismatch: inferred type is IndexSettings! but IndexService.IndexCreationContext! was expected
e: file:///elasticsearch-sudachi/src/test/java/com/worksap/nlp/elasticsearch/sudachi/index/SearchEngineEnv.kt:54:48 No value passed for parameter 'p1'

FAILURE: Build completed with 2 failures.

sudachi analyzer have a bug with icu_normalizer　(wrong offset)

ES version: 8.5.1
elasticsearch-sudachi: 3.0.1

インデックス定義：

PUT sudachi_icu_test
{
  "settings": {
    "index": {
      "analysis": {
        "char_filter": {
          "icu_normalize": {
            "type": "icu_normalizer",
            "name": "nfkc_cf",
            "mode": "compose"
          }
        },
        "filter": {
          "sudachi_searchfilter": {
            "type": "sudachi_split",
            "mode": "search"
          }
        },
        "analyzer": {
          "sudachi_index_analyzer_with": {
            "type": "custom",
            "char_filter": "icu_normalize",
            "tokenizer": "sudachi_tokenizer",
            "filter": "sudachi_searchfilter"
          },
          "sudachi_index_analyzer_without": {
            "type": "custom",
            "tokenizer": "sudachi_tokenizer",
            "filter": "sudachi_searchfilter"
          },
          "kuromoji_index_analyzer_with": {
            "type": "custom",
            "char_filter": "icu_normalize",
            "tokenizer": "kuromoji_tokenizer"
          },
          "kuromoji_index_analyzer_without": {
            "type": "custom",
            "tokenizer": "kuromoji_tokenizer"
          }
        }
      }
    }
  }
}

icu ありの場合、問題ある。「white」の結果は「whit」。

# request
GET sudachi_icu_test/_analyze
{
  "analyzer": "sudachi_index_analyzer_with",
  "text": ["white"]
}
 
# response
{
  "tokens": [
    {
      "token": "whit",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    }
  ]
}

icuなしの場合、問題なし。「white」の結果は「white」

# request
GET sudachi_icu_test/_analyze
{
  "analyzer": "sudachi_index_analyzer_without",
  "text": ["white"]
}
 
# response
{
  "tokens": [
    {
      "token": "white",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    }
  ]
}

ご確認ください。

Support Elasticsearch 8

Builds targeting elasticsearch8 seem to fail.

$ ./gradlew -PelasticsearchVersion=8.2.2 build

FAILURE: Build failed with an exception.

* Where:
Build file 'xxxxxx/elasticsearch-sudachi/build.gradle' line: 41

* What went wrong:
Could not determine the dependencies of task ':distZip'.
> Could not create task ':copyDependencies'.
   > Could not resolve all files for configuration ':runtimeClasspath'.
      > Could not find org.elasticsearch.client:transport:8.2.2.
        Searched in the following locations:
          - https://repo.maven.apache.org/maven2/org/elasticsearch/client/transport/8.2.2/transport-8.2.2.pom
        If the artifact you are trying to retrieve can be found in the repository but without metadata in 'Maven POM' format, you need to adjust the 'metadataSources { ... }' of the repository declaration.
        Required by:
            project :

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 5s

Transport Client has been deprecated and removed in Elasticsearch 8, and there is no org.elasticsearch.client:transport:8.x package.
https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/transport-client.html

Is there any way to make it work with elasiticsearch 8?

Readingform filter for OOVs

I'm trying to use readingform filter with following settings.
The result for OOV seems blank.
Is this expected result ? or I missing any settings?

{
  "settings":{
    "index":{
      "analysis":{
        "tokenizer":{
          "my_ja_tokenizer":{
            "type":"sudachi_tokenizer",
            "mode":"search",
            "discard_punctuation":true,
            "resources_path":"\/etc\/elasticsearch\/sudachi"
          }
        },
        "filter":{
          "my_kana_readingform":{
            "type":"sudachi_readingform",
            "use_romaji":false
          }
        }
      }
    }
  }
}

Request

{
  "tokenizer": "my_ja_tokenizer",
  "filter": ["my_kana_readingform"], 
  "text": ["settei"]
}

Expected Response

{
  "tokens" : [
    {
      "token" : "settei",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    }
  ]
}

Actual Response

{
  "tokens" : [
    {
      "token" : "",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    }
  ]
}

Latest Sudachi causes java.security.AccessControlException

Elasticsearch crushed on /_analyze API using elasticsearch-sudachi:v6.5.4-1.2.0 with latest sudachi-1.2.0-SNAPSHOT.
Maybe the cause is this commit.
WorksApplications/Sudachi@001709b

Security permissions are required when open a dictionary.

Pull request : #37

[2019-02-06T10:55:27,835][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [MCwJKXL] fatal error in thread [elasticsearch[MCwJKXL][analyze][T#1]], exiting
java.lang.ExceptionInInitializerError: null
        at com.worksap.nlp.sudachi.DictionaryFactory.create(DictionaryFactory.java:56) ~[?:?]
        at com.worksap.nlp.lucene.sudachi.ja.SudachiTokenizer.<init>(SudachiTokenizer.java:89) ~[?:?]
        at com.worksap.nlp.lucene.sudachi.ja.SudachiTokenizer.<init>(SudachiTokenizer.java:79) ~[?:?]
        at com.worksap.nlp.elasticsearch.sudachi.index.SudachiTokenizerFactory.create(SudachiTokenizerFactory.java:68) ~[?:?]
        at org.elasticsearch.index.analysis.CustomAnalyzer.createComponents(CustomAnalyzer.java:89) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:134) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
        at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:196) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
        at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.simpleAnalyze(TransportAnalyzeAction.java:259) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.analyze(TransportAnalyzeAction.java:244) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:165) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:81) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$1.doRun(TransportSingleShardAction.java:112) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.5.4.jar:6.5.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "accessClassInPackage.sun.misc")
        at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) ~[?:?]
        at java.security.AccessController.checkPermission(AccessController.java:895) ~[?:?]
        at java.lang.SecurityManager.checkPermission(SecurityManager.java:322) ~[?:?]
        at java.lang.SecurityManager.checkPackageAccess(SecurityManager.java:1290) ~[?:?]
        at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:896) ~[?:?]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:521) ~[?:?]
        at java.lang.Class.forName0(Native Method) ~[?:?]
        at java.lang.Class.forName(Class.java:315) ~[?:?]
        at com.worksap.nlp.sudachi.JapaneseDictionary.<clinit>(JapaneseDictionary.java:223) ~[?:?]
        ... 17 more

Romanization "ふ" becomes "ho" by sudachi_readingform.

I think there's a mistake here. I think the "ふ"'s romanization should be "hu", not "ho".

PUT test/
{
    "settings": {
        "index": {
            "analysis": {
                "filter": {
                    "sudachi_romaji_readingform": {
                        "type": "sudachi_readingform",
                        "use_romaji": true
                    }
                },
                "tokenizer": {
                    "sudachi_tokenizer": {
                        "type": "sudachi_tokenizer",
                        "resources_path": "/etc/elasticsearch/sudachi"
                    }
                },
                "analyzer": {
                    "romaji_analyzer": {
                        "tokenizer": "sudachi_tokenizer",
                        "filter": [
                            "sudachi_romaji_readingform"
                        ]
                    }
                }
            }
        }
    }
}

POST test/_analyze
{
  "text":"ふ",
  "analyzer": "romaji_analyzer"
}

Returns with :

{
  "tokens" : [
    {
      "token" : "ho",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    }
  ]
}

Target code:
elasticsearch-sudachi/src/main/java/com/worksap/nlp/lucene/sudachi/ja/util/Romanizer.java
line 480 - 518

            case 'フ':
                switch(ch2) {
                case 'ァ':
                    builder.append("fwa");
                    i++;
                    break;
                case 'ィ':
                    builder.append("fwi");
                    i++;
                    break;
                case 'ゥ':
                    builder.append("fwu");
                    i++;
                    break;
                case 'ェ':
                    builder.append("fwe");
                    i++;
                    break;
                case 'ォ':
                    builder.append("fwo");
                    i++;
                    break;
                case 'ャ':
                    builder.append("fya");
                    i++;
                    break;
                case 'ュ':
                    builder.append("fyu");
                    i++;
                    break;
                case 'ョ':
                    builder.append("fyo");
                    i++;
                    break;
                default:
                    builder.append("ho");
                    break;
                }
                break;

Unexpected behavior with `sudachi_ja_stop` Preceding `sudachi_normalizedform`

When sudachi_ja_stop is placed before sudachi_normalizedform, it does not work as expected.

I added an experiment to the test code of the forked repository, and you can actually run it. The test passes, but the behavior is not as expected.

When sudachi_ja_stop is placed before sudachi_normalizedform as shown below, the stopwords do not work as intended. I expected the query "東京にふく" to be split into "東京", "に", and "ふく", and with the use of stopwords, only "東京" would remain. However, the actual result is "東京", "に", and "吹く". This happens even if "吹く" or "に" is included in stopwords; the behavior remains the same.

{
  "index.analysis": {
    "analyzer": {
      "sudachi_test": {
        "type": "custom",
        "tokenizer": "sudachi_tokenizer",
        "filter": ["my_stopfilter", "sudachi_normalizedform"]
      }
    },
    "tokenizer": {
      "sudachi_tokenizer": {
        "type": "sudachi_tokenizer",
        "split_mode": "C"
      }
    },
    "filter": {
      "my_stopfilter": {
        "type": "sudachi_ja_stop",
        "stopwords": ["に", "ふく", "吹く"]
      }
    }
  }
}

Conversely, if the order of sudachi_ja_stop and sudachi_normalizedform is swapped, and the normalized string ("吹く") is included in stopwords, it works. The query "東京にふく" is converted to "東京", but it is not the expected behavior to include the normalized string in stopwords.

{
  "index.analysis": {
    "analyzer": {
      "sudachi_test": {
        "type": "custom",
        "tokenizer": "sudachi_tokenizer",
        "filter": ["sudachi_normalizedform", "my_stopfilter"]
      }
    },
    "tokenizer": {
      "sudachi_tokenizer": {
        "type": "sudachi_tokenizer",
        "split_mode": "C"
      }
    },
    "filter": {
      "my_stopfilter": {
        "type": "sudachi_ja_stop",
        "stopwords": ["に", "吹く"]
      }
    }
  }
}

The specific issue is that in the phrase "確認したい" , the word "し" is not dropped by stopwords as desired, because it is transformed into "為る" and cannot be excluded. As a workaround, adding "為る" to stopwords resolves the issue, but it is not a fundamental solution.

I am eager to contribute to the development. I have already started reading and trying to understand the code. If there is anything I can help with, please let me know.

Support ElasticSearch 8.3

8.3 once more have changed analysis plugin internal APIs, so it needs some work to support

Romanization "susi" return empty token by sudachi_readingform

use analysis-sudachi-elasticsearch7.5-1.3.2.

I want to use romaji for autocomplete.
I think "すs"'s romanization want to "sus", and "susi"'s romanization to be "susi".

PUT /romaji_test
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "sudachi_romaji_readingform": {
            "type": "sudachi_readingform",
            "use_romaji": true
          }
        },
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer",
            "resources_path": "/usr/share/elasticsearch/config/sudachi/",
            "settings_path": "/usr/share/elasticsearch/config/sudachi/sudachi.json"
          }
        },
        "analyzer": {
          "romaji_analyzer": {
            "tokenizer": "sudachi_tokenizer",
            "filter": [
              "sudachi_romaji_readingform"
            ]
          }
        }
      }
    }
  }
}

POST /romaji_test/_analyze
{
  "text":"すし",
  "analyzer": "romaji_analyzer"
}
POST /romaji_test/_analyze
{
  "text":"すs",
  "analyzer": "romaji_analyzer"
}
POST /romaji_test/_analyze
{
  "text":"susi",
  "analyzer": "romaji_analyzer"
}

Returns with :

{
  "tokens" : [
    {
      "token" : "susi",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    }
  ]
}
{
  "tokens" : [
    {
      "token" : "su",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    }
  ]
}
{
  "tokens" : [
    {
      "token" : "",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    }
  ]
}

In addition, kuromoji_tokenizer Returns "sushi, sus, and susi".

deploy javadoc and sources to Maven Repository

Currently deploy phase in Travis CI does not deploy javadoc and sources.

Can not install the plugin

Hi,
I tried to install the sudachi plugin to my Elasticsearch cluster on Elastic cloud. However, kibana failed to restart once I add the plugin into it.
So my also tried to install the plugin v3.0.1 and v3.0.0 to my local Elasticsearch v8.5.3 on windows PC, but got following error,
Is there any ideal to solve it?

D:\00Data\dev\dev\elasticsearch-8.5.3-windows-x86_64\elasticsearch-8.5.3\bin>elasticsearch-plugin install D:\00Data\dev\dev\elasticsearch-8.5.3-windows-x86_64\elasticsearch-8.5.3\bin\analysis-sudachi-8.5.3-3.0.0.zip
-> Installing D:\00Data\dev\dev\elasticsearch-8.5.3-windows-x86_64\elasticsearch-8.5.3\bin\analysis-sudachi-8.5.3-3.0.0.zip
-> Downloading D:\00Data\dev\dev\elasticsearch-8.5.3-windows-x86_64\elasticsearch-8.5.3\bin\analysis-sudachi-8.5.3-3.0.0.zip
-> Failed installing D:\00Data\dev\dev\elasticsearch-8.5.3-windows-x86_64\elasticsearch-8.5.3\bin\analysis-sudachi-8.5.3-3.0.0.zip
-> Rolling back D:\00Data\dev\dev\elasticsearch-8.5.3-windows-x86_64\elasticsearch-8.5.3\bin\analysis-sudachi-8.5.3-3.0.0.zip
-> Rolled back D:\00Data\dev\dev\elasticsearch-8.5.3-windows-x86_64\elasticsearch-8.5.3\bin\analysis-sudachi-8.5.3-3.0.0.zip
Exception in thread "main" java.net.MalformedURLException: unknown protocol: d
at java.base/java.net.URL.(URL.java:682)
at java.base/java.net.URL.(URL.java:570)
at java.base/java.net.URL.(URL.java:517)
at org.elasticsearch.plugins.cli.InstallPluginAction.downloadZip(InstallPluginAction.java:458)
at org.elasticsearch.plugins.cli.InstallPluginAction.download(InstallPluginAction.java:329)
at org.elasticsearch.plugins.cli.InstallPluginAction.execute(InstallPluginAction.java:247)
at org.elasticsearch.plugins.cli.InstallPluginCommand.execute(InstallPluginCommand.java:89)
at org.elasticsearch.common.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:54)
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:85)
at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:94)
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:85)
at org.elasticsearch.cli.Command.main(Command.java:50)
at org.elasticsearch.launcher.CliToolLauncher.main(CliToolLauncher.java:64)

D:\00Data\dev\dev\elasticsearch-8.5.3-windows-x86_64\elasticsearch-8.5.3\bin>

Best regards

Unable to get MorphemeAttribute from other plugins

We are trying to provide a filter for Sudachi analysis results in another plugin. While the unit test passes successfully, We are unable to get the MorphemeAttribute on Es. The following error message is displayed:

"type" : "illegal_state_exception",
"reason" : "Attribute MorphemeAttribute was not present"

To resolve this issue, it is necessary for the AnalysisSudachiPlugin to inherit from the ExtensiblePlugin.

words are forcedly split up by period `.`

This is follow-up issue for the discussion:
https://sudachi-dev.slack.com/archives/C9SUQFK38/p1528178419000237

For example, I.B.M cannot be extracted but split up I b m.

$ curl -XGET "http://localhost:9200/user/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "sudachi_tokenizer_core_normal", 
  "text": "I.B.M"
}'
{
  "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "b",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "m",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    }
  ]
}

My tokenizer settings is:

          "tokenizer": {
            "sudachi_tokenizer_core_normal": {
              "mode": "normal",
              "settings_path": "sudachi_core/sudachi.json",
              "resources_path": "sudachi_core",
              "type": "sudachi_tokenizer",
              "discard_punctuation": "false"
            },

and sudachi settings (sudachi.json) is here:

{
    "systemDict" : "system_core.dic",
    "inputTextPlugin" : [
        { "class" : "com.worksap.nlp.sudachi.DefaultInputTextPlugin" },
        { "class" : "com.worksap.nlp.sudachi.ProlongedSoundMarkInputTextPlugin",
          "prolongedSoundMarks": ["ー", "-", "⁓", "〜", "〰"],
          "replacementSymbol": "ー"}
    ],
    "oovProviderPlugin" : [
        { "class" : "com.worksap.nlp.sudachi.MeCabOovProviderPlugin" },
        { "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",
          "oovPOS" : [ "補助記号", "一般", "*", "*", "*", "*" ],
          "leftId" : 5968,
          "rightId" : 5968,
          "cost" : 3857 }
    ],
    "pathRewritePlugin" : [
        { "class" : "com.worksap.nlp.sudachi.JoinNumericPlugin",
          "joinKanjiNumeric" : true },
        { "class" : "com.worksap.nlp.sudachi.JoinKatakanaOovPlugin",
          "oovPOS" : [ "名詞", "普通名詞", "一般", "*", "*", "*" ],
          "minLength" : 3
        }
    ]
}

Support OpenSearch up to 2.11.1

OpenSearch 2.9.0 or higher is not supported.
Cannot install due to engine version mismatch.

Support Elasticsearch 7.9.x

https://www.elastic.co/guide/en/elasticsearch/reference/7.x/release-notes-7.9.1.html

Surrogate pair not properly handled in SudachiSplitFilter

In SudachiSpliterFilter, an OOV token will also have per-character output in extended mode.

However, the "characters" are handled as char array, which causes a problem when there are surrogate pairs.

For example, when the input text is "𝑇", there will be 3 tokens

"𝑇"
String.valueOf("𝑇".toCharArray()[0])
String.valueOf("𝑇".toCharArray()[1])

(Possibly that the similar problem exists outside this filter too?)

Build Tests failed due to encoding issue under Japanese Win10

Under Japanese Windows10, when running mvn package, building was successful but build tests would fail due to encoding issue.

The error message is like below:

[ERROR] testReadSentencesComma(com.worksap.nlp.lucene.sudachi.ja.TestSudachiTokenizer)  Time elapsed: 0 s  <<< ERROR!
java.lang.IllegalArgumentException: oovPOS is invalid:名▒?,普通名▒?,▒?般,*,*,*
        at com.worksap.nlp.lucene.sudachi.ja.TestSudachiTokenizer.setup(TestSudachiTokenizer.java:60)

However, I got build success under Japanese Ubuntu virtual machine. I tried several things and finally found a temporary solution.

Solution:

delete old building files or re-clone this repo.
Change the encoding of the file elasticsearch-sudachi\src\test\resources\com\worksap\nlp\lucene\sudachi\ja\sudachi.json to CP932.
run mvn package

Huge input causes OOM.

With the changes in Correctly split text into sentences #204, SudachiTokenizer analyzed all characters (was only first 4096).

The change is fine. But due to the change, I see OOM issue.
SudachiTokenizer.reset() analyzes all text and store the result in ArrayList<MorphemeList>. It causes OOM due to large list size.

I think it would be better to change the analyzing to be done gradually with the SudachiTokenizer.incrementToken() function, instead of all at once with the SudachiTokenizer.reset() function. as well as StandartTokenizer.java

POS Filter: Allow forward matching

The sudachi_part_of_speech filter excludes the words with specified POS information.

Sudachi POS information is a list, consisting 6 items; Currently, a user can specify either

1st-4th items together (excluding asterisk items)
5th item (活用型)
6th item (活用形)
to filter out the result.

Currently, the user needs to specify the entire POS information.

It would be convenient if a user can just write part of the POS (say, first 1 or 2 items of the POS information list), and the filtering is done by forward matching.

Thanks to @cidrugHug8 for mentioning the topic in Elasticsearchのための新しい形態素解析器「Sudachi」 - Qiita (in Japanese).

skip sonarqube for branches that cannot access secret token

Github Actions running on branches outside this repository cannot use secret tokens.
This makes actions with latest es version always fail before running integration test.

IllegalArgumentException with ICU Normalization char filter combination

When using the ICU normalization char filter in Elasticsearch to analyze strings exceeding 4096 characters, an IllegalArgumentException occurs with Sudachi plugin version 2.0.0 or later (including the latest version).

The exception occurs in the read method of Lucene's ICUNormalizer2CharFilter class because the offset specified in the parameter exceeds the buffer length (off >= cbuf.length). The cause of the problem seems to be that the stopping conditions for reading characters into the CharBuffer differ slightly between the tokenizeSentences method of Sudachi's JapaneseTokenizer class and the read method of the ICUNormalizer2CharFilter class.

https://sudachi-dev.slack.com/archives/CBCF278AC/p1677227368729219

異なる split_mode を併用する方法について

@mh-northlander @kazuma-t

素晴らしいプラグインの開発をありがとうございます。

#75 に関連してご質問させてください。
こちらのコメントで、dis_maxを利用してAモードとCモードを併用する例を示していただいています。

    "query": {
        "dis_max": { "queries": [
            { "match": {
                "sentence": {
                    "query": "${text}",
                    "analyzer": "sudachi_a"
                }
            }},
            { "match": {
                "sentence": {
                    "query": "${text}",
                    "analyzer": "sudachi_c"
                }
            }}
        ]}
    },

「AモードとCモードで高い方のスコアが使われる」という挙動になると理解しています。これは以前 sudachi_split を検索時に利用できていた時の挙動と一致するものでしょうか？

背景として、sudachi_split を検索時に利用できていた時の挙動を再現したいという事情がありまして、ご質問させていただきました。

Duplicate tokens for OOV when using `sudachi_split` filter's `extended` mode

I get strange output when using sudachi_split plugin with extended mode. The results are fine when using search mode.

For example, the input text bミチゴ becomes [b, b , チ, ゴ, ミチゴ, ミ, チ, ゴ].

Also, strangely, the analysis result changes after the 2nd and onward analysis.

Example

Elasticsearch v7.7.0
elasticsearch-sudachi v7.7.0-2.0.2
sudachi-dictionary-20200330-core

Elasticsearch index setting

sudachi_split filter with extended mode.

Full JSON

{
    "settings": {
        "analysis": {
            "analyzer": {
                "custom_sudachi_analyzer": {
                    "type": "custom",
                    "tokenizer": "custom_sudachi_tokenizer",
                    "char_filter": [],
                    "filter": ["custom_sudachi_split"]
                }
            },
            "tokenizer": {
                "custom_sudachi_tokenizer": {
                    "type": "sudachi_tokenizer",
                    "resources_path": "sudachi/"
                }
            },
            "filter": {
                "custom_sudachi_split": {
                    "type": "sudachi_split",
                    "mode": "extended"
                }
            }
        }
    }
}

Analysis with the tokenizer

Case A. `aミチゴ` 👍

=> a / ミチゴ / ミ / チ / ゴ

Full query and response

Query

GET http://localhost:9200/sudachi-split-test/_analyze
{
    "analyzer": "custom_sudachi_analyzer",
    "text": "aミチゴ"
}

Response

{
    "tokens": [
        {
            "token": "a",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "ミチゴ",
            "start_offset": 1,
            "end_offset": 4,
            "type": "word",
            "position": 1,
            "positionLength": 3
        },
        {
            "token": "ミ",
            "start_offset": 1,
            "end_offset": 2,
            "type": "word",
            "position": 1
        },
        {
            "token": "チ",
            "start_offset": 2,
            "end_offset": 3,
            "type": "word",
            "position": 2
        },
        {
            "token": "ゴ",
            "start_offset": 3,
            "end_offset": 4,
            "type": "word",
            "position": 3
        }
    ]
}

Case B. `bミチゴ` 👎

Analysis for the 1st time 😕

=> b / b / ミチゴ / ミ / チ / ゴ

Full query and response

Query

GET http://localhost:9200/sudachi-split-test/_analyze
{
    "analyzer": "custom_sudachi_analyzer",
    "text": "bミチゴ"
}

{
    "tokens": [
        {
            "token": "b",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "b",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "ミチゴ",
            "start_offset": 1,
            "end_offset": 4,
            "type": "word",
            "position": 1,
            "positionLength": 3
        },
        {
            "token": "ミ",
            "start_offset": 1,
            "end_offset": 2,
            "type": "word",
            "position": 1
        },
        {
            "token": "チ",
            "start_offset": 2,
            "end_offset": 3,
            "type": "word",
            "position": 2
        },
        {
            "token": "ゴ",
            "start_offset": 3,
            "end_offset": 4,
            "type": "word",
            "position": 3
        }
    ]
}

Analysis for the 2nd time onwards 😕 😕 😕

=> b / b / チ / ゴ / ミチゴ / ミ / チ / ゴ

Full query and response

Query

GET http://localhost:9200/sudachi-split-test/_analyze
{
    "analyzer": "custom_sudachi_analyzer",
    "text": "bミチゴ"
}

{
    "tokens": [
        {
            "token": "b",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "b",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "チ",
            "start_offset": 1,
            "end_offset": 2,
            "type": "word",
            "position": 1
        },
        {
            "token": "ゴ",
            "start_offset": 2,
            "end_offset": 3,
            "type": "word",
            "position": 2
        },
        {
            "token": "ミチゴ",
            "start_offset": 1,
            "end_offset": 4,
            "type": "word",
            "position": 3,
            "positionLength": 3
        },
        {
            "token": "ミ",
            "start_offset": 1,
            "end_offset": 2,
            "type": "word",
            "position": 3
        },
        {
            "token": "チ",
            "start_offset": 2,
            "end_offset": 3,
            "type": "word",
            "position": 4
        },
        {
            "token": "ゴ",
            "start_offset": 3,
            "end_offset": 4,
            "type": "word",
            "position": 5
        }
    ]
}

Reference: Sudachi analysis result (w/o Elasticsearch)

a is not OOV, whereas b is.

Case A. `aミチゴ`

$ echo "aミチゴ" | java -jar target/sudachi-0.4.3.jar -a
a       名詞,普通名詞,助数詞可能,*,*,*  a       a       アール  0
ミチゴ  名詞,普通名詞,一般,*,*,*        ミチゴ  ミチゴ          -1      (OOV)
EOS

Debug output

$ echo "aミチゴ" | java -jar target/sudachi-0.4.3.jar -d

=== Input dump:
aミチゴ
=== Lattice dump:
0: 10 10 (null)(0) BOS/EOS 0 0 0: 50 -739 -286 -944 211 -250 -852 50 -739 -286 -944 211 -250 -973 -852 -852 -522 -522 1908 50 -739 -286 -944 211 -250
1: 1 10 ミチゴ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: -951 1598 1598 -951 1598 1598 -520 63 -720 -106 317
2: 1 10 ミチゴ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: -368 1807 1807 -368 1807 1807 110 829 634 907 97
3: 1 10 ミチゴ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: -177 1616 1616 -177 1616 1616 121 -250 1173 417 394
4: 1 10 ミチゴ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: -321 1904 1904 -321 1904 1904 -550 911 420 1611 -200
5: 1 10 ミチゴ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 239 1619 1619 239 1619 1619 -737 251 1118 703 687
6: 1 10 ミチゴ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 666 -805 -805 666 -805 -805 1244 570 560 462 -1442
7: 4 10 チゴ(219342) 名詞,普通名詞,一般,*,*,* 5142 5142 3939: 2052 -1145 5211 657 884 432 169 1010 722
8: 4 10 チゴ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1317 35 2704 -520 66 63 -720 -106 317
9: 4 10 チゴ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1862 -386 1662 110 154 829 634 907 97
10: 4 10 チゴ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 837 595 1583 121 437 -250 1173 417 394
11: 4 10 チゴ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 2540 758 1900 -550 395 911 420 1611 -200
12: 4 10 チゴ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1367 -358 1571 -737 547 251 1118 703 687
13: 4 10 チゴ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 1508 992 -880 1244 745 570 560 462 -1442
14: 7 10 ゴ(202390) 名詞,数詞,*,*,*,* 4904 4904 13000: 2864 1110 1995 1111 748 2080 1458 1247 1940 1887 1995 1111 748 2080 1458 1247
15: 7 10 ゴ(202391) 名詞,普通名詞,一般,*,*,* 5142 5142 10761: 3392 2837 657 884 432 169 1010 722 2798 5211 657 884 432 169 1010 722
16: 7 10 ゴ(202392) 名詞,普通名詞,一般,*,*,* 5142 5142 8764: 3392 2837 657 884 432 169 1010 722 2798 5211 657 884 432 169 1010 722
17: 7 10 ゴ(202393) 名詞,普通名詞,一般,*,*,* 5146 5146 6234: 1132 1039 890 684 998 35 570 698 2099 2428 890 684 998 35 570 698
18: 7 10 ゴ(202394) 名詞,普通名詞,一般,*,*,* 5146 5146 8119: 1132 1039 890 684 998 35 570 698 2099 2428 890 684 998 35 570 698
19: 7 10 ゴ(202395) 接頭辞,*,*,*,*,* 5950 5950 8138: 1672 1340 1985 1061 1271 1256 1744 494 2650 1688 1985 1061 1271 1256 1744 494
20: 7 10 ゴ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1193 636 -520 66 63 -720 -106 317 -1016 2704 -520 66 63 -720 -106 317
21: 7 10 ゴ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1738 717 110 154 829 634 907 97 -560 1662 110 154 829 634 907 97
22: 7 10 ゴ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 1262 995 121 437 -250 1173 417 394 -191 1583 121 437 -250 1173 417 394
23: 7 10 ゴ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 1387 1464 -550 395 911 420 1611 -200 -512 1900 -550 395 911 420 1611 -200
24: 7 10 ゴ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1569 468 -737 547 251 1118 703 687 -873 1571 -737 547 251 1118 703 687
25: 7 10 ゴ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 794 1254 1244 745 570 560 462 -1442 1356 -880 1244 745 570 560 462 -1442
26: 1 7 ミチ(255291) 名詞,固有名詞,人名,名,*,* 4789 4789 6820: 2897 2778 2778 2897 2778 2778 987 1222 507 1957 365
27: 1 7 ミチ(255292) 名詞,普通名詞,形状詞可能,*,*,* 5159 5159 3633: 1043 1651 1651 1043 1651 1651 1151 1634 1272 1302 521
28: 1 7 ミチ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: -951 1598 1598 -951 1598 1598 -520 63 -720 -106 317
29: 1 7 ミチ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: -368 1807 1807 -368 1807 1807 110 829 634 907 97
30: 1 7 ミチ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: -177 1616 1616 -177 1616 1616 121 -250 1173 417 394
31: 1 7 ミチ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: -321 1904 1904 -321 1904 1904 -550 911 420 1611 -200
32: 1 7 ミチ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 239 1619 1619 239 1619 1619 -737 251 1118 703 687
33: 1 7 ミチ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 666 -805 -805 666 -805 -805 1244 570 560 462 -1442
34: 4 7 チ(218952) 名詞,普通名詞,一般,*,*,* 5144 5144 4708: 1561 2832 3880 -1429 -285 -112 -280 -316 397
35: 4 7 チ(218953) 記号,一般,*,*,*,* 5977 5977 20000: 492 807 -2936 2391 1722 895 1000 1026 301
36: 4 7 チ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1317 35 2704 -520 66 63 -720 -106 317
37: 4 7 チ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1862 -386 1662 110 154 829 634 907 97
38: 4 7 チ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 837 595 1583 121 437 -250 1173 417 394
39: 4 7 チ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 2540 758 1900 -550 395 911 420 1611 -200
40: 4 7 チ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1367 -358 1571 -737 547 251 1118 703 687
41: 4 7 チ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 1508 992 -880 1244 745 570 560 462 -1442
42: 1 4 ミ(254959) 名詞,数詞,*,*,*,* 4934 4934 13000: 2342 2166 2166 2342 2166 2166 1614 -16 2016 1636 1263
43: 1 4 ミ(254960) 接頭辞,*,*,*,*,* 5953 5953 6459: 2639 1717 1717 2639 1717 1717 1605 437 1120 1793 514
44: 1 4 ミ(254961) 記号,一般,*,*,*,* 5977 5977 20000: 2031 -709 -709 2031 -709 -709 2391 895 1000 1026 301
45: 1 4 ミ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: -951 1598 1598 -951 1598 1598 -520 63 -720 -106 317
46: 1 4 ミ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: -368 1807 1807 -368 1807 1807 110 829 634 907 97
47: 1 4 ミ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: -177 1616 1616 -177 1616 1616 121 -250 1173 417 394
48: 1 4 ミ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: -321 1904 1904 -321 1904 1904 -550 911 420 1611 -200
49: 1 4 ミ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 239 1619 1619 239 1619 1619 -737 251 1118 703 687
50: 1 4 ミ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 666 -805 -805 666 -805 -805 1244 570 560 462 -1442
51: 0 1 A(207) 名詞,普通名詞,助数詞可能,*,*,* 5152 5152 10078: 2104
52: 0 1 A(208) 記号,文字,*,*,*,* 5978 5978 20000: 417
53: 0 1 A(209) 記号,文字,*,*,*,* 5978 5978 20000: 417
54: 0 1 a(5187) 名詞,普通名詞,助数詞可能,*,*,* 5152 5152 8579: 2104
55: 0 1 a(5188) 記号,文字,*,*,*,* 5978 5978 20000: 417
56: 0 1 a(5189) 記号,文字,*,*,*,* 5978 5978 20000: 417
57: 0 1 a(0) 名詞,普通名詞,一般,*,*,* 5139 5139 11633: 893
58: 0 1 a(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13620: 234
59: 0 1 a(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 14228: 709
60: 0 1 a(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 15793: 506
61: 0 1 a(0) 感動詞,一般,*,*,*,* 5687 5687 15246: -640
62: 0 0 (null)(0) BOS/EOS 0 0 0: 0
=== Before rewriting:
0: 0 1 a(5187) 11 5152 5152 8579
1: 1 10 ミチゴ(0) 3 5139 5139 10980
=== After rewriting:
0: 0 1 a(5187) 11 5152 5152 8579
1: 1 10 ミチゴ(0) 3 5139 5139 10980
===
a       名詞,普通名詞,助数詞可能,*,*,*  a
ミチゴ  名詞,普通名詞,一般,*,*,*        ミチゴ
EOS

Case B. `bミチゴ`

$ echo "bミチゴ" | java -jar target/sudachi-0.4.3.jar -a
b       名詞,普通名詞,一般,*,*,*        b       b               -1      (OOV)
ミチゴ  名詞,普通名詞,一般,*,*,*        ミチゴ  ミチゴ          -1      (OOV)
EOS

Debug output

$ echo "bミチゴ" | java -jar target/sudachi-0.4.3.jar -d

=== Input dump:
bミチゴ
=== Lattice dump:
0: 10 10 (null)(0) BOS/EOS 0 0 0: 50 -739 -286 -944 211 -250 -852 50 -739 -286 -944 211 -250 -973 -852 -852 -522 -522 1908 50 -739 -286 -944 211 -250
1: 1 10 ミチゴ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1598 1598 1598 1598 -520 63 -720 -106 317
2: 1 10 ミチゴ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1807 1807 1807 1807 110 829 634 907 97
3: 1 10 ミチゴ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 1616 1616 1616 1616 121 -250 1173 417 394
4: 1 10 ミチゴ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 1904 1904 1904 1904 -550 911 420 1611 -200
5: 1 10 ミチゴ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1619 1619 1619 1619 -737 251 1118 703 687
6: 1 10 ミチゴ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: -805 -805 -805 -805 1244 570 560 462 -1442
7: 4 10 チゴ(219342) 名詞,普通名詞,一般,*,*,* 5142 5142 3939: 2052 -1145 5211 657 884 432 169 1010 722
8: 4 10 チゴ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1317 35 2704 -520 66 63 -720 -106 317
9: 4 10 チゴ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1862 -386 1662 110 154 829 634 907 97
10: 4 10 チゴ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 837 595 1583 121 437 -250 1173 417 394
11: 4 10 チゴ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 2540 758 1900 -550 395 911 420 1611 -200
12: 4 10 チゴ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1367 -358 1571 -737 547 251 1118 703 687
13: 4 10 チゴ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 1508 992 -880 1244 745 570 560 462 -1442
14: 7 10 ゴ(202390) 名詞,数詞,*,*,*,* 4904 4904 13000: 2864 1110 1995 1111 748 2080 1458 1247 1940 1887 1995 1111 748 2080 1458 1247
15: 7 10 ゴ(202391) 名詞,普通名詞,一般,*,*,* 5142 5142 10761: 3392 2837 657 884 432 169 1010 722 2798 5211 657 884 432 169 1010 722
16: 7 10 ゴ(202392) 名詞,普通名詞,一般,*,*,* 5142 5142 8764: 3392 2837 657 884 432 169 1010 722 2798 5211 657 884 432 169 1010 722
17: 7 10 ゴ(202393) 名詞,普通名詞,一般,*,*,* 5146 5146 6234: 1132 1039 890 684 998 35 570 698 2099 2428 890 684 998 35 570 698
18: 7 10 ゴ(202394) 名詞,普通名詞,一般,*,*,* 5146 5146 8119: 1132 1039 890 684 998 35 570 698 2099 2428 890 684 998 35 570 698
19: 7 10 ゴ(202395) 接頭辞,*,*,*,*,* 5950 5950 8138: 1672 1340 1985 1061 1271 1256 1744 494 2650 1688 1985 1061 1271 1256 1744 494
20: 7 10 ゴ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1193 636 -520 66 63 -720 -106 317 -1016 2704 -520 66 63 -720 -106 317
21: 7 10 ゴ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1738 717 110 154 829 634 907 97 -560 1662 110 154 829 634 907 97
22: 7 10 ゴ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 1262 995 121 437 -250 1173 417 394 -191 1583 121 437 -250 1173 417 394
23: 7 10 ゴ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 1387 1464 -550 395 911 420 1611 -200 -512 1900 -550 395 911 420 1611 -200
24: 7 10 ゴ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1569 468 -737 547 251 1118 703 687 -873 1571 -737 547 251 1118 703 687
25: 7 10 ゴ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 794 1254 1244 745 570 560 462 -1442 1356 -880 1244 745 570 560 462 -1442
26: 1 7 ミチ(255291) 名詞,固有名詞,人名,名,*,* 4789 4789 6820: 2778 2778 2778 2778 987 1222 507 1957 365
27: 1 7 ミチ(255292) 名詞,普通名詞,形状詞可能,*,*,* 5159 5159 3633: 1651 1651 1651 1651 1151 1634 1272 1302 521
28: 1 7 ミチ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1598 1598 1598 1598 -520 63 -720 -106 317
29: 1 7 ミチ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1807 1807 1807 1807 110 829 634 907 97
30: 1 7 ミチ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 1616 1616 1616 1616 121 -250 1173 417 394
31: 1 7 ミチ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 1904 1904 1904 1904 -550 911 420 1611 -200
32: 1 7 ミチ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1619 1619 1619 1619 -737 251 1118 703 687
33: 1 7 ミチ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: -805 -805 -805 -805 1244 570 560 462 -1442
34: 4 7 チ(218952) 名詞,普通名詞,一般,*,*,* 5144 5144 4708: 1561 2832 3880 -1429 -285 -112 -280 -316 397
35: 4 7 チ(218953) 記号,一般,*,*,*,* 5977 5977 20000: 492 807 -2936 2391 1722 895 1000 1026 301
36: 4 7 チ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1317 35 2704 -520 66 63 -720 -106 317
37: 4 7 チ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1862 -386 1662 110 154 829 634 907 97
38: 4 7 チ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 837 595 1583 121 437 -250 1173 417 394
39: 4 7 チ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 2540 758 1900 -550 395 911 420 1611 -200
40: 4 7 チ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1367 -358 1571 -737 547 251 1118 703 687
41: 4 7 チ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 1508 992 -880 1244 745 570 560 462 -1442
42: 1 4 ミ(254959) 名詞,数詞,*,*,*,* 4934 4934 13000: 2166 2166 2166 2166 1614 -16 2016 1636 1263
43: 1 4 ミ(254960) 接頭辞,*,*,*,*,* 5953 5953 6459: 1717 1717 1717 1717 1605 437 1120 1793 514
44: 1 4 ミ(254961) 記号,一般,*,*,*,* 5977 5977 20000: -709 -709 -709 -709 2391 895 1000 1026 301
45: 1 4 ミ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1598 1598 1598 1598 -520 63 -720 -106 317
46: 1 4 ミ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1807 1807 1807 1807 110 829 634 907 97
47: 1 4 ミ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 1616 1616 1616 1616 121 -250 1173 417 394
48: 1 4 ミ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 1904 1904 1904 1904 -550 911 420 1611 -200
49: 1 4 ミ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1619 1619 1619 1619 -737 251 1118 703 687
50: 1 4 ミ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: -805 -805 -805 -805 1244 570 560 462 -1442
51: 0 1 B(573) 記号,文字,*,*,*,* 5978 5978 20000: 417
52: 0 1 B(574) 記号,文字,*,*,*,* 5978 5978 20000: 417
53: 0 1 b(5244) 記号,文字,*,*,*,* 5978 5978 20000: 417
54: 0 1 b(5245) 記号,文字,*,*,*,* 5978 5978 20000: 417
55: 0 1 b(0) 名詞,普通名詞,一般,*,*,* 5139 5139 11633: 893
56: 0 1 b(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13620: 234
57: 0 1 b(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 14228: 709
58: 0 1 b(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 15793: 506
59: 0 1 b(0) 感動詞,一般,*,*,*,* 5687 5687 15246: -640
60: 0 0 (null)(0) BOS/EOS 0 0 0: 0
=== Before rewriting:
0: 0 1 b(0) 3 5139 5139 11633
1: 1 4 ミ(254960) 67 5953 5953 6459
2: 4 10 チゴ(219342) 3 5142 5142 3939
=== After rewriting:
0: 0 1 b(0) 3 5139 5139 11633
1: 1 10 ミチゴ(0) 3 5139 5139 10980
===
b       名詞,普通名詞,一般,*,*,*        b
ミチゴ  名詞,普通名詞,一般,*,*,*        ミチゴ
EOS

offsets for string array conflict each other

Version : es 6.2.2

When I indexed string array fields and analyzed them with sudachi, error below occurred.

{ "error": { "root_cause": [ { "type": "illegal_argument_exception", "reason": "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=1,endOffset=8,lastStartOffset=2 for field 'tags.analyze'" } ], "type": "illegal_argument_exception", "reason": "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=1,endOffset=8,lastStartOffset=2 for field 'tags.analyze'" }, "status": 400 }

===============================
It seems to be a bug.
Here is the result of testing with analyze API

REQUEST
GET test/_analyze { "text": ["オフセットがおかしい","オフセットが重複している"], "field": "field.sudachi" }

RESPONSE
{ "tokens": [ { "token": "オフセット", "start_offset": 0, "end_offset": 5, "type": "word", "position": 0 }, { "token": "おかしい", "start_offset": 6, "end_offset": 10, "type": "word", "position": 2 }, { "token": "オフセット", "start_offset": 1, "end_offset": 6, "type": "word", "position": 103 }, { "token": "重複", "start_offset": 7, "end_offset": 9, "type": "word", "position": 105 } ] }

Thanks.

Refine or remove MorphemeConsumerAttribute

/**
 * This attribute tells Sudachi-based TokenStreams not to produce anything into
 * {@link org.apache.lucene.analysis.tokenattributes.CharTermAttribute} if it is
 * not the current consumer. <br>
 * This is performance optimisation and will not change correctness if resetting
 * {@code CharTermAttribute} before writing into it.
 */
interface MorphemeConsumerAttribute ...

With MorphemeConsumerAttribute, termAtt become empty the last filter (with this att) and this makes filters that use termAtt cannot work correctly before that.
#122 disables MCA to avoid this.

To make this attribute work with those filters, it need to say set termAtt when necessary even if it's not the current consumer.
Maybe we can add some option to manually tell filters to set termAtt.

Another option is to remove this attribute, accepting some performance deterioration.

Note: this attribute is also used to warn filters are no-op.

sudachi_split を設定しても検索時に A単位が使われない。

こんにちは。

A単位とC単位を併用して検索を行うため、以下のように sudachi_split を設定したファイルを使ってドキュメントをインサートしたのですが、検索のときにA単位が使われていないためか、想定している検索結果がでてきません（検索結果）。

Elasticsearch のバージョンは 7.7.0、elasticsearch-sudachi のバージョンは 2.1.0 です。

Analyze API の結果をみるとA単位のトークンもC単位のトークンも出力されています。

TermVector の状態をみるとA単位とC単位両方でインデクシングされており、sudachi_split をオフにしてインサートするとC単位のみになるため、インサートのときは sudachi_split がうまく機能しているようです。

explain オプションを設定して sentence フィールドで「関西国際空港」を検索したときの結果をみると、 sentence:関西国際空港 と sentence:\"関西国際空港\" の2つが検索されています。

「関西国際空港」のようにスペース区切りで検索するとそれぞれが別のトークンとしてヒットします。

おそらく、A単位で検索するためには sentence:関西 sentence:国際 sentence:空港 のようになるべきところで sentence:\"関西国際空港\" となっているのが問題であるように思われます。

設定ファイル

{
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": {
            "analyzer": {
                "custom_analyzer": {
                    "type": "custom",
                    "tokenizer": "custom_sudachi_tokenizer",
                    "filter": [
                        "custom_sudachi_split"
                    ]
                }
            },
            "tokenizer": {
                "custom_sudachi_tokenizer": {
                    "type": "sudachi_tokenizer",
                    "discard_punctuation": false
                }
            },
            "filter": {
                "custom_sudachi_split": {
                    "type": "sudachi_split",
                    "mode": "extended"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "sentence": {
                "type": "text",
                "index": true,
                "term_vector": "with_positions_offsets",
                "analyzer": "custom_analyzer"
            }
        }
    }
}

インサートしたドキュメント

{"index": {"_id": "1"}}
{"title":"メモ1","sentence":"関西に住んでました。"}
{"index":{"_id": "2"}}
{"title":"メモ2","sentence":"関西国際空港に行きました。"}
{"index":{"_id":"3"}}
{"title":"メモ3","sentence":"国際的な職場です。"}
{"index":{"_id": "4"}}
{"title":"メモ4","sentence":"空港に来ました。"}

「関西国際空港」の検索結果

# query
curl -s  -H 'Content-Type: application/json' -XGET localhost:9200/minimal-sudachi/_search?pretty  -d '{"query": {"match":{"sentence": "関西国際空港"}}}'

# result
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 3.1501026,
    "hits" : [
      {
        "_index" : "minimal-sudachi",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 3.1501026,
        "_source" : {
          "title" : "メモ2",
          "sentence" : "関西国際空港に行きました。"
        }
      }
    ]
  }
}

Analyze API の結果

# query
curl -X GET "localhost:9200/minimal-sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"custom_analyzer", "text" : "関西国際空港"}'

# result

{
  "tokens" : [
    {
      "token" : "関西国際空港",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0,
      "positionLength" : 3
    },
    {
      "token" : "関西",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "国際",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "空港",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    }
  ]
}

sudachi_split フィルターを外したときの結果

$  curl -H 'Content-Type: application/json' -XGET "localhost:9200/minimal-sudachi/_termvectors/2?pretty"
{
  "_index" : "minimal-sudachi",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 1,
  "found" : true,
  "took" : 0,
  "term_vectors" : {
    "sentence" : {
      "field_statistics" : {
        "sum_doc_freq" : 24,
        "doc_count" : 4,
        "sum_ttf" : 24
      },
      "terms" : {
        "。" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 5,
              "start_offset" : 12,
              "end_offset" : 13
            }
          ]
        },
        "た" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 4,
              "start_offset" : 11,
              "end_offset" : 12
            }
          ]
        },
        "に" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 6,
              "end_offset" : 7
            }
          ]
        },
        "まし" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 3,
              "start_offset" : 9,
              "end_offset" : 11
            }
          ]
        },
        "行き" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 9
            }
          ]
        },
        "関西国際空港" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 6
            }
          ]
        }
      }
    }
  }
}

explain オプションを設定したときの「関西国際空港」のクエリ結果

$ curl -H 'Content-Type: application/json' -XGET "localhost:9200/minimal-sudachi/_search?pretty"  -d '{"explain": true, "query": {"match":{"sentence": "関西国際空港"}}}'
{
  ...
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 3.1501026,
    "hits" : [
      {
        "_shard" : "[minimal-sudachi][0]",
        "_node" : "6L-bPRuCT0qpwpA17zVAgw",
        "_index" : "minimal-sudachi",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 3.1501026,
        "_source" : {
          "title" : "メモ2",
          "sentence" : "関西国際空港に行きました。"
        },
        "_explanation" : {
          "value" : 3.1501026,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 1.1550897,
              "description" : "weight(sentence:関西国際空港 in 1) [PerFieldSimilarity], result of:",
              "details": [...]
            },
            {
              "value" : 1.995013,
              "description" : "weight(sentence:\"関西 国際 空港\" in 1) [PerFieldSimilarity], result of:",
              "details": [...]
            }
          ]
        }
      }
    ]
  }
}

TermVector の状態

$  curl -H 'Content-Type: application/json' -XGET "localhost:9200/minimal-sudachi/_termvectors/2?pretty"
{
  "_index" : "minimal-sudachi",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 2,
  "found" : true,
  "took" : 0,
  "term_vectors" : {
    "sentence" : {
      "field_statistics" : {
        "sum_doc_freq" : 58,
        "doc_count" : 8,
        "sum_ttf" : 58
      },
      "terms" : {
        "。" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 7,
              "start_offset" : 12,
              "end_offset" : 13
            }
          ]
        },
        "た" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 6,
              "start_offset" : 11,
              "end_offset" : 12
            }
          ]
        },
        "に" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 3,
              "start_offset" : 6,
              "end_offset" : 7
            }
          ]
        },
        "まし" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 5,
              "start_offset" : 9,
              "end_offset" : 11
            }
          ]
        },
        "国際" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 2,
              "end_offset" : 4
            }
          ]
        },
        "空港" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 4,
              "end_offset" : 6
            }
          ]
        },
        "行き" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 4,
              "start_offset" : 7,
              "end_offset" : 9
            }
          ]
        },
        "関西" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 2
            }
          ]
        },
        "関西国際空港" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 6
            }
          ]
        }
      }
    }
  }
}

スペース区切りでの検索結果

$ curl -H 'Content-Type: application/json' -XGET "localhost:9200/minimal-sudachi/_search?pretty"  -d '{"explain": true, "query": {"match":{"sentence": "関西 国際 空港"}}}'
{
  ...
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : 1.9950131,
    "hits" : [
      {
        "_shard" : "[minimal-sudachi][0]",
        "_node" : "6L-bPRuCT0qpwpA17zVAgw",
        "_index" : "minimal-sudachi",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.9950131,
        "_source" : {
          "title" : "メモ2",
          "sentence" : "関西国際空港に行きました。"
        },
        "_explanation" : {
          "value" : 1.9950131,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.6650044,
              "description" : "weight(sentence:国際 in 1) [PerFieldSimilarity], result of:",
              "details": [...]
            },
            {
              "value" : 0.6650044,
              "description" : "weight(sentence:空港 in 1) [PerFieldSimilarity], result of:",
              "details": [...]
            },
            {
              "value" : 0.6650044,
              "description" : "weight(sentence:関西 in 1) [PerFieldSimilarity], result of:",
              "details": [...]
            }
          ]
        }
      },
      ...
    ]
  }
}

The synonym filter is being influenced by other filters

Hi team,

Thank you for a great plugin. I am really surprised and satisfied with the quality of Sudachi.

I've encountered some weird results and I'm unsure whether it's a bug or intended behavior.

Issue

It seems that the configuration of the synonym filter is being influenced by sudachi_part_of_speech. In a previous issue, it was suggested that the synonym filter should be applied last. However, applying it last appears to affect other filters. Is this behavior intentional?

Environment

elasticsearch-8.8.1-analysis-sudachi-3.1.0

Configuration (elasticsearch/index.json):

PUT /sudachi-test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "sudachi_search_analyzer_c": {
          "type": "custom",
          "tokenizer": "sudachi_tokenizer_c",
          "discard_punctuation": true,
          "filter": [
            "sudachi_pos_filter",
            "synonym_filter"
          ]
        }
      },
      "tokenizer": {
        "sudachi_tokenizer_c": {
          "type": "sudachi_tokenizer",
          "split_mode": "C",
          "discard_punctuation": true
        }
      },
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "山ほど => 山程, たくさん, 一杯"
          ]
        },
        "sudachi_pos_filter": {
          "type": "sudachi_part_of_speech",
          "stoptags": [
            "代名詞",
            "形状詞-タリ",
            "形状詞-助動詞語幹",
            "連体詞",
            "接続詞",
            "感動詞",
            "助動詞",
            "助詞",
            "補助記号",
            "空白"
          ]
        }
      }
    }
  }
}

Query and result

GET /sudachi-test/_analyze
{
  "text": "山に行った",
  "analyzer": "sudachi_search_analyzer_c"
}

{
  "tokens": [
    {
      "token": "山程",
      "start_offset": 0,
      "end_offset": 1,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "たくさん",
      "start_offset": 0,
      "end_offset": 1,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "一",
      "start_offset": 0,
      "end_offset": 1,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "行っ",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 2
    },
    {
      "token": "杯",
      "start_offset": 2,
      "end_offset": 4,
      "type": "SYNONYM",
      "position": 2
    }
  ]
}

GET /sudachi-test/_analyze
{
  "text": "山ほど遊んだ",
  "analyzer": "sudachi_search_analyzer_c"
}

{
  "tokens": [
    {
      "token": "山ほど",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "遊ん",
      "start_offset": 3,
      "end_offset": 5,
      "type": "word",
      "position": 1
    }
  ]
}

A phrase 山に行った and 山ほど遊んだ resulted in unexpected results and a synonym word 山ほど seems to be transformed into 山.
(FYI A sudachi synonym dict contains 山ほど,一杯 )

Based on the query result, it appears that the order of the filters is affecting the outcome. While it was suggested to apply the synonym filter last, doing so seems to impact other filters. Is this behavior intentional, or is there a need for correction?

Here is my repository that I experimented. You can reproduce this issue.

If you swap the order of synonym_filter and sudachi_pos_filter, some queries 山 resulted in null_pointer_exception.

PUT /sudachi-test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "sudachi_search_analyzer_c": {
          "type": "custom",
          "tokenizer": "sudachi_tokenizer_c",
          "discard_punctuation": true,
          "filter": [
            "synonym_filter",
            "sudachi_pos_filter"
            ]
        }
      },
      "tokenizer": {
        "sudachi_tokenizer_c": {
          "type": "sudachi_tokenizer",
          "split_mode": "C",
          "discard_punctuation": true
        }
      },
      "filter": {
        "synonym_filter" : {
            "type" : "synonym",
            "synonyms": [
              "山ほど => 山程, たくさん, 一杯"
            ]
        },
        "sudachi_pos_filter": {
            "type": "sudachi_part_of_speech",
            "stoptags": [
              "代名詞",
              "形状詞-タリ",
              "形状詞-助動詞語幹",
              "連体詞",
              "接続詞",
              "感動詞",
              "助動詞",
              "助詞",
              "補助記号",
              "空白"
            ]
          }
      }
    }
  }
}

GET /sudachi-test/_analyze
{
  "text": "山",
  "analyzer": "sudachi_search_analyzer_c"
}

{
  "error": {
    "root_cause": [
      {
        "type": "null_pointer_exception",
        "reason": """Cannot invoke "com.worksap.nlp.sudachi.Morpheme.partOfSpeechId()" because "morpheme" is null"""
      }
    ],
    "type": "null_pointer_exception",
    "reason": """Cannot invoke "com.worksap.nlp.sudachi.Morpheme.partOfSpeechId()" because "morpheme" is null"""
  },
  "status": 500
}

Failed to launch on elasticsearch 7.6 and 7.7 when system_core.dic doesn't exist

Description

On elasticsearch 7.6 and 7.7, elasticsearch fails to launch when elasticsearch-sudachi is installed but config/sudachi/system_core.dic doesn't exist.

Expected behavior

Elasticsearch starts successfully even if system_core.dic doesn't exist. Dictionary files are required only when creating and using an index that uses analysis-sudachi analyzer.

Version

elasticsearch-sudachi 2.0.1
Elasticsearch 7.6.2 and 7.7.0 (on Docker)
Docker on Windows 2.3.0.3 (Docker Engine 19.03.8)

Steps to reproduce

Elasticsearch 7.5.2 (works)

# Launch a container
> docker run -it --rm --env discovery.type=single-node --env bootstrap.memory_lock=true docker.elastic.co/elasticsearch/elasticsearch:7.5.2 sh
# Install elasticsearch-sudachi
> elasticsearch-plugin install https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v7.5.2-2.0.1/analysis-sudachi-elasticsearch7.5-2.0.1.zip
# Launch elasticsearch
> /usr/local/bin/docker-entrypoint.sh
...
{"type": "server", ..., "message": "loaded plugin [analysis-sudachi]" }
...
{"type": "server", ..., "message": "Active license is now [BASIC]; ...",  ...  }

Elasticsearch 7.6.2 (fails)

# Launch a container
> docker run -it --rm --env discovery.type=single-node --env bootstrap.memory_lock=true docker.elastic.co/elasticsearch/elasticsearch:7.6.2 sh
# Install elasticsearch-sudachi
> elasticsearch-plugin install https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v7.6.2-2.0.1/analysis-sudachi-elasticsearch7.6-2.0.1.zip
# Launch elasticsearch
> /usr/local/bin/docker-entrypoint.sh
...
{"type": "server", ..., "message": "loaded plugin [analysis-sudachi]" }
...
{"type": "server", "timestamp": "...", "level": "ERROR", "component": "o.e.x.c.t.IndexTemplateRegistry", "cluster.name": "docker-cluster", "node.name": "...",
"message": "error adding index template [ilm-history] from [/ilm-history.json] for [index_lifecycle]",
"cluster.uuid": "...", "node.id": "..." ,
"stacktrace": ["java.io.UncheckedIOException: java.io.FileNotFoundException: /usr/share/elasticsearch/config/sudachi/system_core.dic (No such file or directory)",
"at com.worksap.nlp.lucene.sudachi.ja.SudachiAnalyzer.createComponents(SudachiAnalyzer.java:93) ~[?:?]",
"at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:136) ~[lucene-core-8.5.1.jar:8.5.1 edb9fc409398f2c3446883f9f80595c884d245d0 - ivera - 2020-04-08 08:55:42]",
"at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:199) ~[lucene-core-8.5.1.jar:8.5.1 edb9fc409398f2c3446883f9f80595c884d245d0 - ivera - 2020-04-08 08:55:42]",
"at org.elasticsearch.index.analysis.AnalysisRegistry.checkVersions(AnalysisRegistry.java:649) ~[elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.index.analysis.AnalysisRegistry.produceAnalyzer(AnalysisRegistry.java:613) ~[elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:532) ~[elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:218) ~[elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:428) ~[elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:611) ~[elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:549) ~[elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.cluster.metadata.MetaDataIndexTemplateService.validateTemplate(MetaDataIndexTemplateService.java:404) ~[elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.cluster.metadata.MetaDataIndexTemplateService.access$300(MetaDataIndexTemplateService.java:73) ~[elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.cluster.metadata.MetaDataIndexTemplateService$4.execute(MetaDataIndexTemplateService.java:301) ~[elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702) ~[elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324) ~[elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219) [elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.7.0.jar:7.7.0]",
"at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.7.0.jar:7.7.0]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]",
"at java.lang.Thread.run(Thread.java:832) [?:?]",
"Caused by: java.io.FileNotFoundException: /usr/share/elasticsearch/config/sudachi/system_core.dic (No such file or directory)",
"at java.io.FileInputStream.open0(Native Method) ~[?:?]",
"at java.io.FileInputStream.open(FileInputStream.java:212) ~[?:?]",
"at java.io.FileInputStream.<init>(FileInputStream.java:154) ~[?:?]",
"at java.io.FileInputStream.<init>(FileInputStream.java:109) ~[?:?]",
"at com.worksap.nlp.sudachi.MMap.map(MMap.java:52) ~[?:?]",
"at com.worksap.nlp.sudachi.dictionary.BinaryDictionary.<init>(BinaryDictionary.java:33) ~[?:?]",
"at com.worksap.nlp.sudachi.dictionary.BinaryDictionary.readSystemDictionary(BinaryDictionary.java:54) ~[?:?]",
"at com.worksap.nlp.sudachi.JapaneseDictionary.readSystemDictionary(JapaneseDictionary.java:106) ~[?:?]",
"at com.worksap.nlp.sudachi.JapaneseDictionary.<init>(JapaneseDictionary.java:56) ~[?:?]",
"at com.worksap.nlp.sudachi.DictionaryFactory.create(DictionaryFactory.java:62) ~[?:?]",
"at com.worksap.nlp.lucene.sudachi.ja.SudachiTokenizer.<init>(SudachiTokenizer.java:75) ~[?:?]",
"at com.worksap.nlp.lucene.sudachi.ja.SudachiTokenizer.<init>(SudachiTokenizer.java:65) ~[?:?]",
"at com.worksap.nlp.lucene.sudachi.ja.SudachiAnalyzer.createComponents(SudachiAnalyzer.java:91) ~[?:?]",
"... 26 more"] }

Synonym expansion not working (Elasticsearch v8 + sudachi_split)

Summary

In an Elasticsearch v8 environment, the synonym expansion is not functioning when using sudachi_split and synonym filters together.

Steps to Reproduce

Set up an Elasticsearch v8 environment
Configure an index to use both sudachi_split and synonym filters
Index documents into the index
Execute a search query containing synonyms

Expected Behavior

The synonym filter should expand synonyms, and documents containing the synonyms should be returned as hits.

Actual Behavior

Synonym expansion does not occur, and documents containing synonyms are not returned as hits.

Related Information

In Elasticsearch v7, the sample configuration provided in the documentation worked for synonym expansion
The documentation was last updated 4 years ago (Elasticsearch v7), and the behavior may have changed in subsequent updates

Environment

OS:
- macOS 13.4.1
- arm64
Docker version: 26.0.0
Elasticsearch version: 8.8.1
elasticsearch-sudachi version: 3.1.0

$ sw_vers
ProductName:            macOS
ProductVersion:         13.4.1
BuildVersion:           22F82

$ uname -m 
arm64

$ hostinfo
Mach kernel version:
         Darwin Kernel Version 22.5.0: Thu Jun  8 22:22:19 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T8103
Kernel configured for up to 8 processors.
8 processors are physically available.
8 processors are logically available.
Processor type: arm64e (ARM64E)
Processors active: 0 1 2 3 4 5 6 7
Primary memory available: 8.00 gigabytes
Default processor set: 419 tasks, 3980 threads, 8 processors
Load average: 2.02, Mach factor: 6.09

$ docker -v
Docker version 26.0.0, build 2ae903e

$ curl -X GET 'http://localhost:9200/'
{
  "name" : "5edac9bc174f",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "rtQ7kzApQ-OSQQ86bnYkPg",
  "version" : {
    "number" : "8.8.1",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "f8edfccba429b6477927a7c1ce1bc6729521305e",
    "build_date" : "2023-06-05T21:32:25.188464208Z",
    "build_snapshot" : false,
    "lucene_version" : "9.6.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}

$ elasticsearch-plugin install https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v3.1.0/elasticsearch-8.8.1-analysis-sudachi-3.1.0.zip

Configuration

Index settings:

{
  "settings": {
    "index": {
      "number_of_replicas": "0",
      "analysis": {
        "filter": {
          "search": {
            "type": "sudachi_split",
            "mode": "search"
          },
          "synonym": {
            "type": "synonym",
            "synonyms": ["関西国際空港,関空", "関西 => 近畿"]
          }
        },
        "tokenizer": {
          "sudachi_c_tokenizer": {
            "type": "sudachi_tokenizer",
            "additional_settings": "{\"systemDict\":\"system_core.dic\"}",
            "discard_punctuation": "true",
            "split_mode": "C"
          }
        },
        "analyzer": {
          "sudachi_search_analyzer": {
            "type": "custom",
            "char_filter": [],
            "tokenizer": "sudachi_c_tokenizer",
            "filter": ["search"]
          },
          "sudachi_synonym_analyzer": {
            "type": "custom",
            "char_filter": [],
            "tokenizer": "sudachi_c_tokenizer",
            "filter": ["synonym"]
          },
          "sudachi_synonym_search_analyzer": {
            "type": "custom",
            "char_filter": [],
            "tokenizer": "sudachi_c_tokenizer",
            "filter": ["synonym", "search"]
          }
        }
      }
    }
  }
}

Analysis Results

With sudachi_split only:

$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_search_analyzer", "text" : "関西国際空港"}'
{
  "tokens" : [
    {
      "token" : "関西国際空港",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0,
      "positionLength" : 3
    },
    {
      "token" : "関西",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "国際",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "空港",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    }
  ]
}

With synonym filter only:

$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_synonym_analyzer", "text" : "関西国際空港"}'
{
  "tokens" : [
    {
      "token" : "関西国際空港",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "関空",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "SYNONYM",
      "position" : 0
    }
  ]
}

With both sudachi_split and synonym filter:

$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_synonym_search_analyzer", "text" : "関西国際空港"}'
{
  "tokens" : [
    {
      "token" : "関西国際空港",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0,
      "positionLength" : 3
    },
    {
      "token" : "関西",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "国際",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "空港",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    }
  ]
}

The synonym expansion (関空) is expected but not occurring.

Questions

Is there a way to make synonym expansion work when using sudachi_split and synonym filters together in an Elasticsearch v8 environment?
Are there any reported issues or documents describing a similar problem?
Have any workarounds or alternative configuration methods been found for this issue?

Any help or guidance would be greatly appreciated. Thank you in advance.

Offsets for string array which is not ending with EOS_SYMBOL conflict

Version: es 6.5.4

When I analyzed array fields containing strings not ending with EOS_SYMBOL, offsets are wrong.

GET my_index/_analyze  '{"analyzer":"sudachi_analyzer", "text":["りんご、ごりら", "らっぱ"]}'

{
  "tokens" : [
    {
      "token" : "りんご",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ごり",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "ら",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "らっぱ",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "word",
      "position" : 103
    }
  ]
}

It seems to be a bug.

Pull request : #39

Make supporting multiple elasticsearch versions easier than it is now

Right now we have multiple branches that are non-source compatible. We want to put everything into a single branch.

Install error on Elasticsearch 6.3.0.

I got the following error during installing analysis-sudachi-elasticsearch6.3.0-1.1.0.zip .

$ wget https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v6.3.0-1.1.0/analysis-sudachi-elasticsearch6.3.0-1.1.0.zip
$ /usr/share/elasticsearch/bin/elasticsearch-plugin install file:///usr/share/elasticsearch/analysis-sudachi-elasticsearch6.3.0-1.1.0.zip
-> Downloading file:///usr/share/elasticsearch/analysis-sudachi-elasticsearch6.3.0-1.1.0.zip
[=================================================] 100%??
ERROR: This plugin was built with an older plugin structure. Contact the plugin author to remove the intermediate "elasticsearch" directory within the plugin zip.

It seems analysis-sudachi-elasticsearch6.3.0-1.1.0.zip not compatible with ES 6.3.0.
The problem can be reproduced on docker.elastic.co/elasticsearch/elasticsearch:6.3.0 docker image.

Migrating plugin to work with OpenSearch

Requesting support for building the sudachi analysis plugin with OpenSearch to enable its installation on OpenSearch clusters.
Some helpful links -

Synonyms not expanded with WARN message "MorphemeFieldFilter does nothing, it is not the current consumer"

Thanks for this useful plugin. Let me report a strange behavior.
We want to use sudachi and synonym filter. However, when using synonym filter with sudachi, synonyms are not expanded.

versions

Elasticsearch: v8.8.1
Sudachi Plugin: elasticsearch-8.8.1-analysis-sudachi-3.1.0

With the combination of Elasticsearch7.16.1 and analytics-sudachi-7.16.1-2.1.1, synonyms were expanded without any problems.

How to reproduce

I have published the configuration to reproduce this issue on the research branch of the github repository.

https://github.com/po3rin/sudachi-elasticsearch-sample/tree/morpheme-error-reproduce

# in sudachi-elasticsearch-sample directory
# Place system_core.dic in elasticsearch/sudachi directory ...

docker-compose up -d -build
curl -X PUT "localhost:9200/test"  --header "Content-Type: application/json" -d @"index.json"

check settings

cat elasticsearch/sudachi/sudachi_synonyms.txt
サルコイドーシス,サルコイド

cat index.json
{
  "settings": {
    "analysis": {
      "analyzer": {
        "sudachi_search_analyzer_c": {
          "type": "custom",
          "tokenizer": "sudachi_tokenizer_c",
          "discard_punctuation": true,
          "filter": [
            "synonym_filter",
            "sudachi_pos_filter",
            "sudachi_baseform",
            "sudachi_normalizedform"
            ]
        }
      },
      "tokenizer": {
        "sudachi_tokenizer_c": {
          "type": "sudachi_tokenizer",
          "split_mode": "C",
          "resources_path": "/app/config/sudachi/",
          "discard_punctuation": true
        }
      },
      "filter": {
        "synonym_filter" : {
            "type" : "synonym_graph",
            "synonyms_path": "/app/config/sudachi/sudachi_synonyms.txt"
        },
        "sudachi_pos_filter": {
            "type": "sudachi_part_of_speech",
            "stoptags": [
              "代名詞",
              "形状詞-タリ",
              "形状詞-助動詞語幹",
              "連体詞",
              "接続詞",
              "感動詞",
              "助動詞",
              "助詞",
              "補助記号",
              "空白"
            ]
          }
      }
    }
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "id": {
        "type": "long",
        "index": true
      },
      "body": {
        "type": "text"
      }
    }
  }
}

create index

curl -X PUT "localhost:9200/test"  --header "Content-Type: application/json" -d @"index.json"

call analyze API

GET localhost:9200/test/_analyze
{
  "analyzer" : "sudachi_search_analyzer_c",
  "text" : "サルコイド"
}

logs

{"@timestamp":"2023-09-18T15:47:13.301Z", "log.level": "WARN", "message":"MorphemeFieldFilter does nothing, it is not the current consumer", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es01][analyze][T#1]","log.logger":"com.worksap.nlp.lucene.sudachi.ja.MorphemeFieldFilter","elasticsearch.cluster.uuid":"WheyDXkCR0OOWQ8E6DYeSw","elasticsearch.node.id":"Edyp8roBQo6bXutCqhA4VQ","elasticsearch.node.name":"es01","elasticsearch.cluster.name":"docker-cluster"}

I appreciate if you check it.

Automatic reloading of the user dic

There is a nice plugin which syncs config files such as user dic among ES instances.

Is it possible to add an automatic reloading feature when the user dic is updated?

Without such functionality we need to open/close all indexes that use the dic on all ES instances, or restart all ES instances. That is a burden.

Fix romaji format

SudachiReadingFormFilter provides romanized pronunciations of tokens. We use Kuromoji's module for it. Because the format of romanizing is odd (modified Hepburn system), it is not useful.

For example,

コウ → kō
シュウ → shū
ロ゜→ lo

See
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/util/ToStringUtil.java

Sometimes romanized pronunciations is used in narrowing with a incomplete query. Kunrei system is better for such a task.

invalid system dictionary on es 6.5.4

Hi,

I meet with the following error when I try sudachi with es,

{
  "error": {
    "root_cause": [
      {
        "type": "remote_transport_exception",
        "reason": "[iCasELc][127.0.0.1:9300][indices:admin/analyze[s]]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "invalid system dictionary"
  },
  "status": 400
}

Here's some background info:
es version: elasticsearch-6.5.4
sudachi version: v6.5.4-1.3.1-SNAPSHOT
dict version: sudachi-dictionary-20201223-core
dict path: $ES_HOME/config/sudachi_tokenizer/system_core.dic
es command:

PUT sudachi_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        }
      }
    }
  }
}

GET sudachi_sample/_analyze
{
    "analyzer": "sudachi_analyzer",
    "text": "関西国際空港"
}

インデックス作成時にエラーが二種類が発生する

はじめまして。素晴らしい開発をありがとうございます。
Sudachiをインストールしてからインデックスを作成しようとした時にエラーが発生しましたのでご報告させていただきます。

環境

Ubuntu 18.04 LTS（WSL2)
Java: OpenJDK 8
Elasticsearch: 7.7.0

エラーログ

[2021-08-27T05:04:43,668][WARN ][r.suppressed             ] [els-node-1] path: /test_sudachi01, params: {index=test_sudachi01}
java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "accessClassInPackage.sun.misc")
        at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) ~[?:?]
        at java.security.AccessController.checkPermission(AccessController.java:1036) ~[?:?]
        at java.lang.SecurityManager.checkPermission(SecurityManager.java:408) ~[?:?]
        at java.lang.SecurityManager.checkPackageAccess(SecurityManager.java:1376) ~[?:?]

対応

セキュリティポリシーを追加
pluginフォルダ配下に置いても読み込まれない（？）ようだったので、起動時のオプションとして指定しました。

/etc/elasticsearch/jvm.options.d/security.options

-Djava.security.policy=/etc/elasticsearch/jvm.options.d/els.policy

/etc/elasticsearch/jvm.options.d/els.policy
elasticsearchのデフォルトポリシーに最下部二行を追加しています。

/*
 * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
 * or more contributor license agreements. Licensed under the Elastic License
 * 2.0 and the Server Side Public License, v 1; you may not use this file except
 * in compliance with, at your election, the Elastic License 2.0 or the Server
 * Side Public License, v 1.
 */

// Default security policy file.
// On startup, BootStrap reads environment and adds additional permissions
// for configured paths and network binding to these.

//// SecurityManager impl:
//// Must have all permissions to properly perform access checks

grant codeBase "${codebase.elasticsearch-secure-sm}" {
  permission java.security.AllPermission;
};

//// Elasticsearch core:
//// These are only allowed inside the server jar, not in plugins
grant codeBase "${codebase.elasticsearch}" {
  // needed for loading plugins which may expect the context class loader to be set
  permission java.lang.RuntimePermission "setContextClassLoader";
};

//// Very special jar permissions:
//// These are dangerous permissions that we don't want to grant to everything.

grant codeBase "${codebase.lucene-core}" {
  // needed to allow MMapDirectory's "unmap hack" (die unmap hack, die)
  // java 8 package
  permission java.lang.RuntimePermission "accessClassInPackage.sun.misc";
  // java 9 "package"
  permission java.lang.RuntimePermission "accessClassInPackage.jdk.internal.ref";
  permission java.lang.reflect.ReflectPermission "suppressAccessChecks";
  // NOTE: also needed for RAMUsageEstimator size calculations
  permission java.lang.RuntimePermission "accessDeclaredMembers";
};

grant codeBase "${codebase.lucene-misc}" {
  // needed to allow shard shrinking to use hard-links if possible via lucenes HardlinkCopyDirectoryWrapper
  permission java.nio.file.LinkPermission "hard";
};

grant codeBase "${codebase.elasticsearch-plugin-classloader}" {
  // needed to create the classloader which allows plugins to extend other plugins
  permission java.lang.RuntimePermission "createClassLoader";
};

grant codeBase "${codebase.jna}" {
  // for registering native methods
  permission java.lang.RuntimePermission "accessDeclaredMembers";
};

//// Everything else:

grant {
  // needed by vendored Guice
  permission java.lang.RuntimePermission "accessClassInPackage.jdk.internal.vm.annotation";

  // checked by scripting engines, and before hacks and other issues in
  // third party code, to safeguard these against unprivileged code like scripts.
  permission org.elasticsearch.SpecialPermission;

  // Allow host/ip name service lookups
  permission java.net.SocketPermission "*", "resolve";

  // Allow reading and setting socket keepalive options
  permission jdk.net.NetworkPermission "getOption.TCP_KEEPIDLE";
  permission jdk.net.NetworkPermission "setOption.TCP_KEEPIDLE";
  permission jdk.net.NetworkPermission "getOption.TCP_KEEPINTERVAL";
  permission jdk.net.NetworkPermission "setOption.TCP_KEEPINTERVAL";
  permission jdk.net.NetworkPermission "getOption.TCP_KEEPCOUNT";
  permission jdk.net.NetworkPermission "setOption.TCP_KEEPCOUNT";

  // Allow read access to all system properties
  permission java.util.PropertyPermission "*", "read";

  // TODO: clean all these property writes up, and don't allow any more in. these are all bogus!

  // LuceneTestCase randomization (locale/timezone/cpus/ssd)
  // TODO: put these in doPrivileged and move these to test-framework.policy
  permission java.util.PropertyPermission "user.language", "write";
  permission java.util.PropertyPermission "user.timezone", "write";
  permission java.util.PropertyPermission "lucene.cms.override_core_count", "write";
  permission java.util.PropertyPermission "lucene.cms.override_spins", "write";
  // messiness in LuceneTestCase: do the above, or clean this up, or simply allow to fail if its denied
  permission java.util.PropertyPermission "solr.solr.home", "write";
  permission java.util.PropertyPermission "solr.data.dir", "write";
  permission java.util.PropertyPermission "solr.directoryFactory", "write";

  // set by ESTestCase to improve test reproducibility
  // TODO: set this with gradle or some other way that repros with seed?
  permission java.util.PropertyPermission "processors.override", "write";

  // TODO: these simply trigger a noisy warning if its unable to clear the properties
  // fix that in randomizedtesting
  permission java.util.PropertyPermission "junit4.childvm.count", "write";
  permission java.util.PropertyPermission "junit4.childvm.id", "write";

  // needed by Settings
  permission java.lang.RuntimePermission "getenv.*";

  // thread permission for the same thread group and ancestor groups
  // (this logic is more strict than the JDK, see SecureSM)
  permission java.lang.RuntimePermission "modifyThread";
  permission java.lang.RuntimePermission "modifyThreadGroup";

  // needed by ExceptionSerializationTests and RestTestCase for
  // some hackish things they do. otherwise only needed by groovy
  // (TODO: clean this up?)
  permission java.lang.RuntimePermission "getProtectionDomain";

  // needed by HotThreads and potentially more
  // otherwise can be provided only to test libraries
  permission java.lang.RuntimePermission "getStackTrace";

  // needed by JMX instead of getFileSystemAttributes, seems like a bug...
  permission java.lang.RuntimePermission "getFileStoreAttributes";

  // needed for jimfs and NewPathForShardsTests
  // otherwise can be provided only to test libraries
  permission java.lang.RuntimePermission "fileSystemProvider";

  // needed by jvminfo for monitoring the jvm
  permission java.lang.management.ManagementPermission "monitor";

  // needed by JDKESLoggerTests
  permission java.util.logging.LoggingPermission "control";

  // load averages on Linux
  permission java.io.FilePermission "/proc/loadavg", "read";

  // read max virtual memory areas
  permission java.io.FilePermission "/proc/sys/vm/max_map_count", "read";

  // OS release on Linux
  permission java.io.FilePermission "/etc/os-release", "read";
  permission java.io.FilePermission "/usr/lib/os-release", "read";
  permission java.io.FilePermission "/etc/system-release", "read";

  // io stats on Linux
  permission java.io.FilePermission "/proc/self/mountinfo", "read";
  permission java.io.FilePermission "/proc/diskstats", "read";

  // control group stats on Linux
  permission java.io.FilePermission "/proc/self/cgroup", "read";
  permission java.io.FilePermission "/sys/fs/cgroup/cpu", "read";
  permission java.io.FilePermission "/sys/fs/cgroup/cpu/-", "read";
  permission java.io.FilePermission "/sys/fs/cgroup/cpuacct", "read";
  permission java.io.FilePermission "/sys/fs/cgroup/cpuacct/-", "read";
  permission java.io.FilePermission "/sys/fs/cgroup/memory", "read";
  permission java.io.FilePermission "/sys/fs/cgroup/memory/-", "read";

  // system memory on Linux systems affected by JDK bug (#66629)
  permission java.io.FilePermission "/proc/meminfo", "read";

  // 2021/08/27
  permission java.lang.RuntimePermission "*";
  permission java.lang.reflect.ReflectPermission "*";
};

また、辞書にアクセスできるようになった後、sudachi-dictionary-20210802-core.zipを読み込もうとすると、「invalid dictionary」が発生しました。

[2021-08-27T06:22:14,163][WARN ][r.suppressed             ] [els-node-1] path: /test_sudachi02, params: {index=test_sudachi02}
java.io.UncheckedIOException: java.io.IOException: invalid dictionary
Caused by: java.io.IOException: invalid dictionary
        at com.worksap.nlp.sudachi.dictionary.BinaryDictionary.<init>(BinaryDictionary.java:47) ~[?:?]
        at com.worksap.nlp.sudachi.dictionary.BinaryDictionary.readSystemDictionary(BinaryDictionary.java:54) ~[?:?]
        at com.worksap.nlp.sudachi.JapaneseDictionary.readSystemDictionary(JapaneseDictionary.java:106) ~[?:?]

sudachi-dictionary-20190425-core.zip　ですと正常に辞書が読み込まれ、インデックスを作成することが出来ました。

sudachi-dictionary-20210802-core

sudachi-dictionary-20190425-core

バイナリエディタでヘッダを比較すると冒頭が違っているようなので、何か変わっているのかもしれません。
お手数をおかけして申し訳ありませんが、ご確認の程、よろしくお願いいたします。

Unable to Reproduce Example as Described in Documentation

Hi,

Thank you for the great plugin.
I cannot reproduce the example written in the official documentation.

Input:

{
  "analyzer": "sudachi_analyzer",
  "text": "寿司がおいしいね"
}

Expected (as described in the document):

{
  "tokens": [
    {
      "token": "寿司",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "美味しい",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 2
    }
  ]
}

Actual (v3.1.0 release with Opensearch 2.6.0)

{
    "tokens": [
        {
            "end_offset": 2,
            "position": 0,
            "start_offset": 0,
            "token": "寿司",
            "type": "word"
        },
        {
            "end_offset": 3,
            "position": 1,
            "start_offset": 2,
            "token": "が",
            "type": "word"
        },
        {
            "end_offset": 7,
            "position": 2,
            "start_offset": 3,
            "token": "美味しい",
            "type": "word"
        }
    ]
}

If you think of any possible causes, please leave a comment. I appreciate your assistance.

Update github workflow actions

Some actions become deprecated.

ref:
https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20

worksapplications / elasticsearch-sudachi Goto Github PK

elasticsearch-sudachi's Introduction

analysis-sudachi

What's new?

Build (if necessary)

Supported ElasticSearch versions

Supported OpenSearch versions

Installation

Update Sudachi

Analyzer

Tokenizer

Dictionary

Example

Filters

sudachi_split

PUT sudachi_sample

POST sudachi_sample/_analyze

sudachi_part_of_speech

PUT sudachi_sample

POST sudachi_sample/_analyze

sudachi_ja_stop

PUT sudachi_sample

POST sudachi_sample/_analyze

sudachi_baseform

PUT sudachi_sample

POST sudachi_sample/_analyze

sudachi_normalizedform

PUT sudachi_sample

POST sudachi_sample/_analyze

sudachi_readingform

PUT sudachi_sample

POST sudachi_sample/_analyze

Synonym

License

elasticsearch-sudachi's People

Contributors

Stargazers

Watchers

Forkers

elasticsearch-sudachi's Issues

Request

Expected Response

Actual Response

Example

Elasticsearch index setting

Analysis with the tokenizer

Case A. aミチゴ 👍

Case B. bミチゴ 👎

Analysis for the 1st time 😕

Analysis for the 2nd time onwards 😕 😕 😕

Reference: Sudachi analysis result (w/o Elasticsearch)

Case A. aミチゴ

Case B. bミチゴ

When I indexed string array fields and analyzed them with sudachi, error below occurred.

設定ファイル

インサートしたドキュメント

「関西国際空港」の検索結果

Analyze API の結果

sudachi_split フィルターを外したときの結果

explain オプションを設定したときの「関西国際空港」のクエリ結果

TermVector の状態

スペース区切りでの検索結果

Issue

Environment

Configuration (elasticsearch/index.json):

Query and result

Description

Expected behavior

Version

Steps to reproduce

Elasticsearch 7.5.2 (works)

Elasticsearch 7.6.2 (fails)

Summary

Steps to Reproduce

Expected Behavior

Actual Behavior

Related Information

Environment

Configuration

Analysis Results

Case A. `aミチゴ` 👍

Case B. `bミチゴ` 👎

Case A. `aミチゴ`

Case B. `bミチゴ`