Code Monkey home page Code Monkey logo

exist-stanford-ner's Introduction

exist-stanford-ner

Integrate the Stanford Named Entity Recognizer into eXist-db.

Demo and documentation are included in the package.

Compile and install

  1. clone the github repository: https://github.com/wolfgangmm/exist-stanford-ner
  2. edit build.properties and set exist.dir to point to your eXist install directory
  3. call "ant" in the directory to create a .xar
  4. upload the xar into eXist using the dashboard

Functions

There are only three functions:

ner:classify-string($classifier as xs:anyURI, $text as xs:string) - processes a single string of text and returns a sequence of text nodes and elements (person, location, organization)

ner:classify-node($classifier as xs:anyURI, $node as node()) as node() - returns an in-memory copy of $node with all named entities wrapped into inline elements.

ner:classify-node($classifier as xs:anyURI, $node as node(), $callback as function(xs:string, xs:string) as item()*) as node() - returns an in-memory copy of $node. Calls the callback function for every entity found and replaces it with the return value of the function.

For Chinese text use the variants: ner:classify-string-cn and ner:classify-node-cn.

Extended documentation can be found after installing the package.

Usage example

xquery version "3.0";

import module namespace ner="http://exist-db.org/xquery/stanford-ner";

let $classifier := xs:anyURI("/db/apps/stanford-ner/resources/classifiers/english.all.3class.distsim.crf.ser.gz")
let $text := <p>The fate of Lehman Brothers, the beleaguered investment bank,
   hung in the balance on Sunday as Federal Reserve officials and the leaders
   of major financial institutions continued to gather in emergency meetings
   trying to complete a plan to rescue the stricken bank.  Several possible
   plans emerged from the talks, held at the Federal Reserve Bank of New York
   and led by Timothy R. Geithner, the president of the New York Fed, and
   Treasury Secretary Henry M. Paulson Jr.</p>
return
 ner:classify-node($classifier, $text)

Support for Chinese

To recognize entities in Chinese texts, you need to obtain the Chinese classifier and segmenter. Before you build the .xar to install, download the classifier and word segmenter using the following links:

From the first package, copy chinese.misc.distsim.crf.ser.gz into resources/classifiers. From the second zip, copy

  • data/dict-chris6.ser.gz
  • data/norm.simp.utf8
  • data/ctb.gz

and everything inside data/dict into the resources/classifiers directory.

exist-stanford-ner's People

Contributors

joewiz avatar ljo avatar wolfgangmm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.