Code Monkey home page Code Monkey logo

vespa-kuromoji-linguistics's Introduction

Vespa Linguistics with Kuromoji Tokenizer

Overview

This package provides Japanese tokenizer with Vespa using Kuromoji. Kuromoji is one of the famous Japanese tokenizer, it is implemented by Java and used by various services such as Solr, Elasticsearch, and so on. For more details, please see official website of Kuromoji.

Create Package

Requirement

JDK (>= 11) and maven are required to build package.

Build

Execute mvn command as below, and you can get package as target/kuromoji-linguistics-${VERSION}-deploy.jar

$ mvn package -Dvespa.version='7.594.36'     # You can specify 7.594.36 or later.

Use Package

Deploy

Put the built package to components directory of your service. If there is no components directory, create it. For example, the structure will be like below with sampleapps.

  • sampleapps/search/music/
    • services.xml
    • components/
      • kuromoji-linguistics-${VERSION}-deploy.jar

Configuration

Because the package will be used by searcher and indexer, it is recommended to define <component> in all <jdisc> sections of services.xml.

<container id="container" version="1.0">
    <component id="kuromoji" class="jp.co.yahoo.vespa.language.lib.kuromoji.KuromojiLinguistics" bundle="kuromoji-linguistics">
        <config name="language.lib.kuromoji.kuromoji">
            <mode>search</mode>
            <ignore_case>true</ignore_case>
        </config>
    </component>
</container>

You can configure package by <config name="language.lib.kuromoji.kuromoji"> (optional). Parameters and default settings are below.

parameter type default description
mode string search mode of Kuromoji (normal OR search OR extended)
kanji.length_threshold int 2 threshold of the length of kanji tokens which is penalized while running the Viterbi search (expert feature).
kanji.penalty int 3000 additional cost for kanji tokens which is longer than the pre-defined length threshold (expert feature).
other.length_threshold int 7 threshold of the length of non-kanji tokens which is penalized while running the Viterbi search (expert feature).
other.penalty int 1700 additional cost for non-kanji tokens which is longer than the pre-defined length threshold (expert feature).
nakaguro_split bool false whether splits unknown words on the middle dot character (U+30FB KATAKANA MIDDLE DOT)
user_dict string - path of user dictionary
tokenlist_name string default target specialtokens name
all_language bool false apply kuromoji tokenizer to all language or only Japanese
ignore_case bool true ignore upper/lower case difference

Activate

Simply use deploy command to activate package. For example, commands will be like below with sampleapps.

$ vespa-deploy prepare sampleapps/search/music/
$ vespa-deploy activate

Now, you can use the tokenizer with "language=ja" options !

License

Code licensed under the Apache 2.0 license. See LICENSE for terms.

Contributor License Agreement

This project requires contributors to agree to a Contributor License Agreement (CLA).

Note that only for contributions to the vespa-kuromoji-linguistics repository on the GitHub (https://github.com/yahoojapan/vespa-kuromoji-linguistics), the contributors of them shall be deemed to have agreed to the CLA without individual written agreements.

vespa-kuromoji-linguistics's People

Contributors

dependabot[bot] avatar hotchpotch avatar mnagaya avatar takamabe avatar y-yuyano avatar yj-jtakagi avatar yjtakamabe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.