Code Monkey home page Code Monkey logo

entity-extractor's Introduction

Entity Extraction

Library to identify and extract domain-specific entities from unstructured text. This library makes use of Apache's OpenNLP software. See the [Javadocs] (https://github.com/stucco/entity-extractor/tree/master/doc) for more details.

Input

Apache's OpenNLP perceptron models in binary format.

  • en-sent.bin - built-in model file to detect sentence boundaries; downloaded from the [OpenNLP wiki] (http://opennlp.sourceforge.net/models-1.5)
  • en-pos-perceptron.bin - built-in model file to assign part of speech tags; downloaded from the [OpenNLP wiki] (http://opennlp.sourceforge.net/models-1.5)
  • test-IOB-perceptron.bin - built-in model file to assign I-O-B formatting tags; created by training a perceptron model on an [annotated cyber security corpus] (https://github.com/stucco/auto-labeled-corpus); used as the default model for testing
  • test-Domain-perceptron.bin - built-in model file used to label domain-specific entities; created by training a perceptron model on an [annotated cyber security corpus] (https://github.com/stucco/auto-labeled-corpus); used as the default model for testing
  • user-created perceptron model file in binary format that represents an I-O-B formatting model
  • user-created perceptron model file in binary format that represents a domain-specific entities model

Output

JSON-formatted string representing the annotated version of the unstructured text. Example JSON string:

{
	"sentences" : [ {
		"sentence" : [ {
  			"word" : "Microsoft",
  			"pos" : "NNP",
  			"iob" : "B",
  			"domainLabel" : "sw.vendor",
  			"domainScore" : 0.3083600176497029
		}, {
  			"word" : "Windows",
  			"pos" : "NNP",
  			"iob" : "B",
  			"domainLabel" : "sw.product",
  			"domainScore" : 0.29496901049795676
		}, {
  			"word" : "XP",
  			"pos" : "NNP",
  			"iob" : "O",
  			"domainLabel" : "O",
  			"domainScore" : 0.30469849374496444
		}, {
 			 "word" : ".",
  			"pos" : ".",
  			"iob" : "O",
  			"domainLabel" : "O",
  			"domainScore" : 0.3035644073881222
		} ]
	}, {
		"sentence" : [ {
  			"word" : "Apple",
  			"pos" : "NNP",
  			"iob" : "B",
  			"domainLabel" : "sw.vendor",
  			"domainScore" : 0.3023481844622021
		}, 
			...
		]
	},
	 ...
	]
}

Usage

EntityExtractor entityExtractor = new EntityExtractor("/path/to/IOB_perceptron_model.bin", "/path/to/Domain_perceptron_model.bin");
String json = entityExtractor.getAnnotatedTextAsJson("Microsoft Windows XP. Apple Mac OS X.");
		
//transform the JSON string into a Sentences instance
ObjectMapper mapper = new ObjectMapper();
Sentences sentences = mapper.readValue(json, Sentences.class);

entity-extractor's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.