Code Monkey home page Code Monkey logo

esanpy's Introduction

Esanpy: Elasticsearch based Analyzer for Python Build Status

Esanpy is Python Text Analyzer based on Elasticsearch. Using Elasticsearch, Esanpy provides powerful and fully-customizable text analysis. Since Esanpy manages Elasticsearch instance internally, you DO NOT need to install/configure Elasticsearch.

Install Esanpy

$ pip install esanpy

If you want to install development version, run as below:

$ git clone https://github.com/codelibs/esanpy.git
$ cd esanpy
$ pip install .

Requirement

  • Python 2.7 or 3.4-3.6
  • Java 8 or above

Python

First of all, import esanpy module.

import esanpy

Start Server

To access to Elasticsearch, use start_server function. This function downloads/configures embedded elasticsearch and plugins, and then start Elasticsearch instance. The elasticsearch is saved in ~/.esanpy directory. If they are configured, this function just start elasticsearch instance.

esanpy.start_server()

Analyze Text

Esanpy provides analyzer and custom_analyzer function.

tokens = esanpy.analyzer("This is a pen.")
# tokens = ["this", "is", "a", "pen"]

To use other analyzer, set an analyzer name with analyzer.

tokens = esanpy.analyzer("今日の天気は晴れです。", analyzer="kuromoji")

custom_analyzer has tokenizer, token_filter and char_filter as arguments.

tokens = esanpy.custom_analyzer('this is a <b>test</b>',
                                tokenizer="keyword",
                                token_filter=["lowercase"],
                                char_filter=["html_strip"])

For Elasticsearch Analyze API, see Analyze.

Stop Server

To stop Elasticsearch, use stop_server().

esanpy.stop_server()

Command

Esanpy provides esanpy command.

$ esanpy --text "This is a pen."
this
is
a
pen

esanpy starts Elasticsearch if it does not run. So, it takes time to start it, but it will be fast after that because Elasticsearch instance is reused.

To change analyzer, use --analyzer option.

$ esanpy --text 今日の天気は晴れです。 --analyzer kuromoji
今日
天気
晴れ

--stop opition stops Elasticsearch instance on the command exit.

$ esanpy --text "This is a pen." --stop

Advance Usecases

Register Analyzer

You can register own analyzers by create_analysis. To register analyzers with my_analyzers namespace:

esanpy.create_analysis('my_analyzers',
                       char_filter={
                           "mapping_ja_filter": {
                               "type": "mapping",
                               "mappings_path": mapping_file
                               }
                       },
                       tokenizer={
                           "kuromoji_user_dict": {
                               "type": "kuromoji_tokenizer",
                               "mode": "normal",
                               "user_dictionary": userdict_file,
                               "discard_punctuation": False
                               }
                       },
                       token_filter={
                           "ja_stopword": {
                               "type": "ja_stop",
                               "stopwords": [
                                   "行く"
                                   ]
                               }
                       },
                       analyzer={
                           "kuromoji_analyzer": {
                               "type": "custom",
                               "char_filter": ["mapping_ja_filter"],
                               "tokenizer": "kuromoji_user_dict",
                               "filter": ["ja_stopword"]
                               }
                       }
                       )

To use kuromoji_analyzer, invoke analyzer with a namespace and analyzer:

tokens = esanpy.analyzer('①東京スカイツリーに行く',
                         analyzer="kuromoji_analyzer",
                         namespace='my_analyzers')
# tokens = ['1', '東京スカイツリー', 'に']

To delete namespace, use delete_analysis:

esanpy.delete_analysis('my_analyzers')

For more information, see Analysis.

Use Kuromoji Neologd

Installing analysis-kuromoji-neologd plugin, you can use Nelogd analyzer. To install it, use --plugin option.

$ esanpy --stop
$ esanpy --plugin org.codelibs:elasticsearch-analysis-kuromoji-neologd:5.6.1

After installation, kuromoji_neologd analyzer is available.

$ esanpy --text 今日の天気は晴れです。 --analyzer kuromoji_neologd
今日の天気
晴れ

Uninstall Esanpy

To remove Esanpy, check/kill processes:

$ ps aux | grep esanpy
$ kill [above PIDs]

and then remove ~/.esanpy directory:

$ rm -rf ~/.esanpy

esanpy's People

Contributors

marevol avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.