Code Monkey home page Code Monkey logo

elasticsearch-ingest-aws-comprehend's Introduction

Elasticsearch Amazon Comprehend NLP Ingest Processor

Elasticsearch ingest processors using Amazon Comprehend for various NLP analysis. All Comprehend detection features are supported via separate processors. Topic Modeling is not supported, although it would be an interesting project to hook up Elasticsearch as a data source for AWS Comprehend topic modeling.

Each field that is sent through the ingest process will result in an AWS Comprehend API call, so this system is not meant for clusters with large workloads. There is no support for batch processing. For better performance, your Elasticsearch ingest nodes should not only be hosted in AWS, but should also be in the region used in the AWS Comprehend API (configurable).

$0.0001 PER UNIT Up to 10M units

$0.00005 PER UNIT From 10M - 50M units

$0.000025 PER UNIT Over 50M units

NLP requests are measured in units of 100 characters, with a 3 unit (300 character) minimum charge per request.

Supported Features

Building

There is no downloadable version of the plugin for two reasons:

  1. It is difficult to release a plugin for each minor version of Elasticsearch. You can only run plugins built for the exact version of Elasticsearch.
  2. Due to the warning at the very top regarding cost and performance, it prefered that the plugin is built and not blindly installed so that users are aware.

Only Elasticsearch 5.6+ is supported in order to take advantage of the secure keystore.

Installation

Only basic credentials are supported. The AWS access and secret keys are added to Elasticsearch keystore, before the node is started.

Plugin Settings

Setting Description
ingest.aws-comprehend.credentials.access_key AWS Acesss key
ingest.aws-comprehend.credentials.secret_key AWS Secret Key
ingest.aws-comprehend.region AWS region used to the API call. Default region is us-east-1

AWS Credentials are not configured in elasticsearch.yml, or in the plugin settings, but in the keystore. Settings must be in place before Elasticsearch is started.

Processor settings

Name Required Default Description
field yes - The field to analyze
target_field no A new field with the name of the source field with a processor specific suffix appended The field to assign the converted value to.
language_code no en The language of the analyzed text. Not used in the Language Detection processor.
min_score no 0 (all returned) The minimum score threshold of values to be returned
max_values no 0 (all returned) The number of values to return. If max_value is 1, a single value is returned and not an array. Not used in the Sentiment Analysis processor.
ignore_missing no false If true and field does not exist or is null, the processor quietly exits without modifying the document
types no empty (all returned) Filter the values returned by the entity types
Feature Processor Name Default suffix
Keyphrase Extraction detect-key-phrases _keyphrases
Sentiment Analysis detect-sentiment _sentiment
Entity Recognition detect-entities _entities
Language Detection detect-dominant-language _language

Examples

After each pipeline is configured, the same document is indexed

PUT /my-index/my-type/1?pipeline=aws-comprehend-pipeline
{
  "my_field" : "It is raining today in Seattle. Good thing I live in California."
}

Language detection

PUT _ingest/pipeline/aws-comprehend-pipeline
{
   "description": "A pipeline to test AWS Comprehend",
   "processors": [
      {
         "detect-dominant-language": {
            "field": "my_field"
         }
      }
   ]
}

Result

{
   "_index": "my-index",
   "_type": "my-type",
   "_id": "1",
   "_version": 1,
   "found": true,
   "_source": {
      "my_field": "It is raining today in Seattle. Good thing I live in California.",
      "my_field_language": [
         "en"
      ]
   }
}

Entity detection

PUT _ingest/pipeline/aws-comprehend-pipeline
{
   "description": "A pipeline to test AWS Comprehend",
   "processors": [
      {
         "detect-entities": {
            "field": "my_field"
         }
      }
   ]
}

Result

{
   "_index": "my-index",
   "_type": "my-type",
   "_id": "1",
   "_version": 1,
   "found": true,
   "_source": {
      "my_field_entities": [
         {
            "text": "today",
            "type": "DATE"
         },
         {
            "text": "Seattle",
            "type": "LOCATION"
         },
         {
            "text": "California",
            "type": "LOCATION"
         }
      ],
      "my_field": "It is raining today in Seattle. Good thing I live in California."
   }
}

Keyphrase Extraction

PUT _ingest/pipeline/aws-comprehend-pipeline
{
   "description": "A pipeline to test AWS Comprehend",
   "processors": [
      {
         "detect-key-phrases": {
            "field": "my_field"
         }
      }
   ]
}

Result

{
   "_index": "my-index",
   "_type": "my-type",
   "_id": "1",
   "_version": 1,
   "found": true,
   "_source": {
      "my_field": "It is raining today in Seattle. Good thing I live in California.",
      "my_field_keyphrases": [
         "today",
         "Seattle",
         "Good thing",
         "California"
      ]
   }
}

Sentiment Analysis

PUT _ingest/pipeline/aws-comprehend-pipeline
{
   "description": "A pipeline to test AWS Comprehend",
   "processors": [
      {
         "detect-sentiment": {
            "field": "my_field"
         }
      }
   ]
}

Result

{
   "_index": "my-index",
   "_type": "my-type",
   "_id": "1",
   "_version": 1,
   "found": true,
   "_source": {
      "my_field": "It is raining today in Seattle. Good thing I live in California.",
      "my_field_sentiment": "POSITIVE"
   }
}

Change the target field to run different processors on the same field.

PUT _ingest/pipeline/aws-comprehend-pipeline
{
   "description": "A pipeline to test AWS Comprehend",
   "processors": [
      {
         "detect-key-phrases": {
            "field": "my_field"
         }
      },
      {
         "detect-key-phrases": {
            "field": "my_field",
            "target_field": "my_field_strict",
            "min_score": 0.9
         }
      },
      {
         "detect-key-phrases": {
            "field": "my_field",
            "target_field": "my_field_trimmed",
            "max_values": 1
         }
      }
   ]
}

Result

{
   "_index": "my-index",
   "_type": "my-type",
   "_id": "1",
   "_version": 1,
   "found": true,
   "_source": {
      "my_field_keyphrases": [
         "today",
         "Seattle",
         "Good thing",
         "California"
      ],
      "my_field_trimmed": "today",
      "my_field_strict": [
         "today",
         "Seattle",
         "California"
      ],
      "my_field": "It is raining today in Seattle. Good thing I live in California."
   }
}

elasticsearch-ingest-aws-comprehend's People

Contributors

brusic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.