Code Monkey home page Code Monkey logo

elasticsearch-analysis-baseform's Introduction

Elasticsearch Analysis Baseform Plugin

Baseform is an analysis plugin for Elasticsearch.

With the baseform analysis, you can use a token filter for reducing word forms to their base form.

Currently, only baseforms for german and english are implemented.

Example: the german base form of zurückgezogen is zurückziehen.

Versions

Plugin Elasticsearch Release date
2.2.1.1 2.2.1 Jun 22, 2016
2.2.1.0 2.2.1 Apr 23, 2016
1.4.0.0 1.4.0 Feb 19, 2015
1.3.0.0 1.3.1 Jul 30, 2014

Installation

Elasticsearch 2.x

./bin/plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-analysis-baseform/2.2.1.1/elasticsearch-analysis-baseform-2.2.1.1-plugin.zip

Elasticsearch 1.x

./bin/plugin -install analysis-baseform -url http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-analysis-baseform/1.4.0.0/elasticsearch-analysis-baseform-1.4.0.0-plugin.zip

Do not forget to restart the node after installing.

Project docs

The Maven project site is available at Github

Issues

All feedback is welcome! If you find issues, please post them at Github

Example (german)

In the settings, set up a token filter of type "baseform" and language "de"::

{
 "index":{
    "analysis":{
        "filter":{
            "baseform":{
                "type" : "baseform",
                "language" : "de"
            }
        },
        "tokenizer" : {
            "baseform" : {
               "type" : "standard",
               "filter" : [ "baseform" ]
            }
        }
    }
 }
}

By using such a tokenizer, the sentence "Die Jahresfeier der Rechtsanwaltskanzleien auf dem Donaudampfschiff hat viel Ökosteuer gekostet" will be tokenized into "Die", "Die", "Jahresfeier", "Jahresfeier", "der", "der", "Rechtsanwaltskanzleien", "Rechtsanwaltskanzlei", "auf", "auf", "dem", "der", "Donaudampfschiff", "Donaudampfschiff", "hat", "haben", "viel", "viel", "Ökosteuer", "Ökosteuer", "gekostet", "kosten"

It is recommended to add the Unique token filter to skip tokens that occur more than once.

Example (english)

In the settings, given this token filter of type "baseform" and language "en" has been set up::

{
   "index" : {
      "analysis" : {
          "filter" : {
              "baseform" : {
                  "type" : "baseform",
                  "language" : "en"
              }
          },
          "analyzer" : {
              "baseform" : {
                  "tokenizer" : "standard",
                  "filter" : [ "baseform", "unique" ]
              }
          }
      }
   }
}

Then, with the text::

“I have a dream that one day this nation will rise up, and live out the true meaning of its creed: ‘We hold these truths to be self-evident: that all men are created equal.’
I have a dream that one day on the red hills of Georgia the sons of former slaves and the sons of former slave owners will be able to sit down together at a table of brotherhood.
I have a dream that one day even the state of Mississippi, a state sweltering with the heat of injustice and sweltering with the heat of oppression, will be transformed into an oasis of freedom and justice.
I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character.
I have a dream today!”

this token stream will be produced::

"I","have","a","dream","that","one","day","this","nation","will","rise","up","and","live","out",
"the","true","meaning","mean","of","its","creed","We","hold","these","truths","truth","to","be",
"self","evident","all","men","man","are","created","create","equal","on","red","hills","hill",
"Georgia","sons","son","former","slaves","slave","owners","owner","able","sit","down","together",
"at","table","brotherhood","even","state","Mississippi","sweltering","swelter","with","heat",
"injustice","oppression","transformed","transform","into","an","oasis","freedom","justice","my",
"four","little","children","child","in","where","they","not","judged","judge","by","color","their",
"skin","but","content","character","today"

As an alternative, separate dictionaries for en-verbs and en-nouns are available.

License

Elasticsearch Baseform Analysis Plugin

Copyright (C) 2013 Jörg Prante

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Credits

The FSA for compiling the fullform/baseform table is taken from Dawid Weiss' morfologik project

https://github.com/morfologik/morfologik-stemming

The german baseform file is a modified version of Daniel Nabers morphology file

http://www.danielnaber.de/morphologie/morphy-mapping-20110717.latin1.gz

and is distributed under CC-BY-SA http://creativecommons.org/licenses/by-sa/3.0/

The english baseforms are a modified version of the english.dict file of http://languagetool.org/download/snapshots/LanguageTool-20131115-snapshot.zip which is licensed under LGPL http://www.fsf.org/licensing/licenses/lgpl.html#SEC1

elasticsearch-analysis-baseform's People

Contributors

jprante avatar yaroslavgaponov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch-analysis-baseform's Issues

StackOverflowError in Dictionary.lookup

I'm using the plugin version from the elasticsearch-plugin-bundle 1.4.0.4 with ES 1.4.2 and I've configured a filter and analyzer like this:

"analysis": {
    "analyzer": {
        "german_foobar": {
            "tokenizer": "standard",
            "filter": [
                "german_foobar"
            ],
            "type": "custom"
        }
    },
    "filter": {
        "german_foobar": {
            "language": "de",
            "type": "baseform"
        }
    }
}

When I try to analyze the string "wurde zum tollen gemacht" with this analyzer, I get a StackOverflowError in Dictionary.lookup on the server:

Exception in thread "main" org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution
    at org.elasticsearch.action.support.AdapterActionFuture.rethrowExecutionException(AdapterActionFuture.java:92)
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:79)
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:61)
    at com.fileee.search.impl.DefaultSearchClient.analyze(DefaultSearchClient.java:389)
    at com.fileee.search.impl.DefaultSearchClient.main(DefaultSearchClient.java:696)
Caused by: java.util.concurrent.ExecutionException: java.lang.StackOverflowError
    at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:288)
    at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:261)
    at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:92)
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:72)
    ... 3 more
Caused by: java.lang.StackOverflowError
    at java.nio.charset.CharsetDecoder.replaceWith(CharsetDecoder.java:303)
    at java.nio.charset.CharsetDecoder.<init>(CharsetDecoder.java:207)
    at java.nio.charset.CharsetDecoder.<init>(CharsetDecoder.java:233)
    at sun.nio.cs.UTF_8$Decoder.<init>(UTF_8.java:84)
    at sun.nio.cs.UTF_8$Decoder.<init>(UTF_8.java:81)
    at sun.nio.cs.UTF_8.newDecoder(UTF_8.java:68)
    at java.lang.StringCoding.decode(StringCoding.java:213)
    at java.lang.String.<init>(String.java:451)
    at org.xbib.elasticsearch.index.analysis.baseform.Dictionary.lookup(Dictionary.java:58)
    at org.xbib.elasticsearch.index.analysis.baseform.Dictionary.lookup(Dictionary.java:59)
    at org.xbib.elasticsearch.index.analysis.baseform.Dictionary.lookup(Dictionary.java:59)
...

BaseformTokenFilter sets incorrect offsets for inserted baseforms

At https://github.com/jprante/elasticsearch-analysis-baseform/blob/master/src/main/java/org/xbib/elasticsearch/index/analysis/baseform/BaseformTokenFilter.java#L43 we set the offsets from the saved token, but when the token was saved, we never set its offsets, so these offsets are always 0, which is dangerous since parts of Lucene assume offsets move forwards.

I think to fix this we should just remove that one line ... because the restoreState(current) right above it will already set the correct offsets.

Upgraded version for 2.4

I tried installing this plugin for elasticsearch version 2.4 and it refused to get installed.

elasticsearch 1.2.*

when is the plugin gonna be compatible with version 1.2.* of elasticsearch? or is there a way that I can install it manually?

Stackoverflow error with german article "einem"

Hey, good job! The baseform analyzer is really cool, doing exactly what I needed!

However, I get Stackoverflow errors when indexing/analyzing a text that contains the german word/article "einem":

GET /myindex/_analyze?analyzer=german&text=mit einem test&pretty=1

throws
[2013-12-17 18:48:47,382][DEBUG][action.admin.indices.analyze] [Karl] failed to execute [org.elasticsearch.action.admin.indices.analyze.AnalyzeRequest@41a330e4]
java.lang.StackOverflowError
at sun.nio.cs.UTF_8$Decoder.decodeLoop(UTF_8.java:324)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:561)
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:158)
at java.lang.StringCoding.decode(StringCoding.java:196)
at java.lang.String.(String.java:491)
at org.xbib.elasticsearch.analysis.baseform.Dictionary.lookup(Dictionary.java:64)
at org.xbib.elasticsearch.analysis.baseform.Dictionary.lookup(Dictionary.java:65)

The last line repeats over and over...

When I change the text to "das ist ein test" everything works fine!

I just found another word that causes an exception: "lange" or "lang"

"dieser test dauert kurz" works fine
"dieser test dauert lange" causes a stack overflow

Problem with highlighting and baseform

If we have some text field with value for EN - "frostbit" (or for DE - "wirkt") and in request try to get highlighting information - we have error: invalidTokenOffsetsException.

It seems endOffset is more then original value size.

Thanks

ES 5.1 / 5.2

Hi @jprante ,

is there any plan to upgrade it for the newer ES versions? Are there any parts, where the community could help you with?

Case sensitive

I'm using this plugin for german text and it seems that it's case sensitive. Is that the case? If yes, what's the reason for that?

Installation fails with file not found

It seems xbib.org is down, and hence I am unable to install the plugin for Elasticsearch 1.x.

Output of installation step:

+ ./bin/plugin -install analysis-baseform -url http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-analysis-baseform/1.4.0.0/elasticsearch-analysis-baseform-1.4.0.0-plugin.zip
-> Installing analysis-baseform...
Trying http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-analysis-baseform/1.4.0.0/elasticsearch-analysis-baseform-1.4.0.0-plugin.zip...
Failed: ConnectException[Connection refused (Connection refused)]
Trying https://github.com/null/analysis-baseform/archive/master.zip...
Failed to install analysis-baseform, reason: failed to download out of all possible locations..., use --verbose to get detailed information

English adjectives are not lemmatized

For example, "quickly" is not reduced to "quick."

It looks like there are lemma files for nouns and verbs, but not for adjectives. Is there a resource for english adjective lemmatization that could be added to the plugin?

Thanks very much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.