Code Monkey home page Code Monkey logo

jeniatagger's Introduction

jeniatagger

Java port of the GENIA Tagger (C++).

Why?

The original C++ program contains several issues. The most important problem is that the tagger uses big dictionaries that have to be reloaded each time the program is called (taking on a modern machine ~15 seconds). With this port if you run on the JVM, the dictionaries are loaded only once and kept in memory until the JVM finishes. Therefore, subsequent calls do not have any loading overhead and so run much faster. With this port it is also possible to specify exactly which analyses you want, namely, to choose from the original, base form, pos tag, shallow parsing, and named-entity recognition. Thus you can make the program run more efficiently both in time and memory.

State of the port

  • The output of the original tagger and jenia's has been thoroughly and successfully tested to be the same.
  • It is known that when unicode characters appear in an instance, some few analyses differ from the original's. In fact, the original c++ code did not handle unicode well. It used std::string which makes a unicode character of >= 2 bytes appear in a string as multiple characters and thus wrongnly increment the size of the containing string. This leads to unexpected behavior in the same original code. Therefore comparison tests are difficult to carry for unicode.
  • The built-in tokenizer has not been implemented yet. For now, you have to use your own. See the original README. Contributions are welcome.
  • Besides some few improvements and refactoring, for now the java code resembles almost exactly the original's.

Installation

First of all you need to download the jeniatagger models. Take note of the local path where you download it.

To use from the command line

Download the runnable jar.

To use from Java or a JVM language

git clone https://github.com/jmcejuela/jeniatagger.git
cd jeniatagger
mvn install -DskipTests # install it in your local maven repository
# mvn assembly:single # create a runnable jar if you wanna use your source modifications from the command line

Use

To use from the command line

Having followed the installation instructions:

java -jar <runnable_jar_path> --models <models_path> --nt <some_file>

(without a given file the program will analyze sentences directly from the command line input)

Each sentence is written in a new line. Empty lines are ignored. The words in the sentence must be already tokenized so that the sentence is a list of space-seprated tokens. For instance:

Degenerin / Epithelial Sodium Channel ( DEG / ENaC )
Drosophila TRPA channel Pyrexia : Pyx - A and Pyx - B
...

To use from Java or a JVM langauge

The API hasn't been formally defined yet. For now see JeniaTagger

The original Genia Tagger

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.