Code Monkey home page Code Monkey logo

jroc's Introduction

JRoc

We ain't afraid of Randy with his candy. You Know What Am I Sayin'?

A REST API for tagging and entity extraction of documents. It can extract entity information in several languages like Norwegian, English and Spanish.

INSTALLATION

 # Clone the repo
 git clone [email protected]:domenicosolazzo/jroc.git

Instance folder

Add an instance folder with a config.py, if you want to override some of the configuration values in your local installation.

Environment variables

  • DEBUG[True|False]: Enable / Disable debugging for the Flask app (Default: False)
  • SECRET_KEY: This is a secret key that is used by Flask to sign cookies. It should be a random value
  • BASIC_AUTH_USERNAME: Username for the basic auth
  • BASIC_AUTH_PASSWORD: Password for the basic auth
  • OBT_TYPE: Type of Oslo-Bergen tagger. Check below for the possible values.
### Options for OBT_TYPE
##### tag-bm.sh
CG and statistical disambiguation, bokmål

##### tag-nostat-bm.sh
CG disambiguation only, bokmål

##### tag-nostat-nn.sh
CG disambiguation only, nynorsk
P.S.

For activating basic auth, you need to set both BASIC_AUTH_USERNAME and BASIC_AUTH_PASSWORD

Deployment

Heroku Deployment

Heroku

  • Install Docker

  • (Mac Only)

    # Run this command to make docker working on your terminal
    eval "$(docker-machine env default)"
    
  • Install the Heroku plugin for Docker (Only the first time)

heroku plugins:install heroku-container-tools
  • Create your heroku app
heroku create <heroku_app_name>
Build the Docker image
  • Use Docker Compose
# It build a new image without using the cache
docker-compose build --no-cache

# Or you can use the cached image
docker-compose build

While building a docker image, sometimes can happen that you are consuming all the space in your harddrive. In that case, run these commands before building the image again:

docker-machine rm default
docker-machine create --driver virtualbox default
eval "$(docker-machine env default)"
Local deployment
  • Create a symlink for Dockerfile.local to Dockerfile
ln -s Dockerfile.local Dockerfile
  • Run the web instance with Docker Compose
docker-compose up web
  • Check if it is running on your browser
$ open "http://$(docker-machine ip default):8080"
Remote deployment
  • Build the image with Docker Compose
# It build a new image without using the cache
docker-compose build --no-cache

# Or you can use the cached image
docker-compose build
  • Run docker:release
heroku container:release

Testing

  • Create your virtualenv
virtualenv <env> && source <env>/bin/activate
  • Additional repositories Clone these repos in the main folder of the project

    • The-Oslo-Bergen-Tagger
    • OBT-Stat: You have to clone inside the one above.
  • MTag

    • Copy the correct mtag file for your OS in The-Oslo-Bergen-Tagger/bin
    cp jroc/bin/<mtag-for-your-OS> The-Oslo-Bergen-Tagger/bin/mtag
    chmod +x The-Oslo-Bergen-Tagger/bin/mtag
    
  • Enviromental variables

export LANG="en_US.UTF-8"
nosetests --with-watch --with-isolation

# With coverage
nosetests --with-watch --with-isolation --with-coverage

N.B. If you get an IOError while running the tests, try to create a tmp folder in the root of the project.

Error Example
IOError: [Errno 2] No such file or directory: <folder here>

Usage

How to use the analyze endpoint

  curl -H "Content-Type: application/json" -X POST -d '{"data":"text_here"}' http://<your-app-domain>/tagger/analyze

How to use the entities endpoint

  curl -H "Content-Type: application/json" -X POST -d '{"data":"text_here"}' http://<your-app-domain>/tagger/entities

How to use the tags endpoint

  curl -H "Content-Type: application/json" -X POST -d '{"data":"text_here"}' http://<your-app-domain>/tagger/tags

How to use the entity extraction endpoint

    curl -H "Content-Type: application/json" -X GET  http://<your-app-domain>/entities/<entity_name>

How to extract all the types connected of a given entity

  curl -H "Content-Type: application/json" -X GET http://<your-app-domain>/entities/<entity_name>/types

How to extract all the properties uri's of a given entity

  curl -H "Content-Type: application/json" -X GET http://<your-app-domain>/entities/<entity_name>/properties

How to extract the property value of given entity

  curl -H "Content-Type: application/json" -X GET http://<your-app-domain>/entities/<entity_name>/properties?name=<property_uri>

How to extract the property value of a given entity in a given language

  curl -H "Content-Type: application/json" -X GET http://<your-app-domain>/entities/<entity_name>/properties?name=<property_uri>&lang=<country_code>

Endpoints

Description of the API endpoints

Endpoint: /tagger/entities

Method: POST It will return all the entities for a given text

Example

  {"data": [
    "Skriftsprog",
    "Sivert",
    "Aasen",
    ...
    "USA",
    "Ivar Aasen"],
    "uri": "http://<your-app-domain>/tagger/entities"
  }

Querystring

  • advanced[0 | 1]: If it is one, it will return the uri for each entity

Example

   {"data": [
    { name:"Skriftsprog", uri: "http://<your-app-domain>/entities/Skriftsprog" },
    { name:"Aasen", uri: "http://<your-app-domain>/entities/Aasen"},
    ...
    { name:"USA", uri: "http://<your-app-domain>/entities/USA"},
    { name:"Ivar Aasen", uri: "http://<your-app-domain>/entities/Ivar_Aasen"}],
    "uri": "http://<your-app-domain>/tagger/entities"
  }

Endpoint: /tagger/tags

It will return all the tags for a given text

Example

  data :[
      "Andreas" ,
      "USA" ,
      "Thoresen" ,
      "Denmark" ,
      "Daae" ,
      "Skodjestrømmen" ,
      ...
      "Aasen" ,
      "Sweden"
  ]

Endpoint: /tagger/analyze

Method: POST It will return all the data from the obt tagger, entities and tags for a given text

Example

  entities: [
      "USA" ,
      "Thoresen" ,
      "Ivar Aasen" ,
      "Herøy" ,
      "Iver Andreas" ,
      "Thoresen" ,
      "Ivar Jonsson" ,
      "Hans Conrad Thoresen" ,
      "Rasmus Aarflots" ,
      "Ludvig Daae" ,
      "Norway" ,
      "Aasen" ,
      "Sweden" ,
      "Stephen Walton"
  ],
  obt: [
      {
           word: "Ivar",
           is_verb: false,
           is_number: {
                ordinal: false,
                is_number: false,
                roman: false,
                quantity: false
           },
           tagging: [
                "Ivar",
                "subst",
                "prop",
                "mask"
           ],
           options: "Ivar subst prop mask",
           is_subst: true,
           is_prop: true
       },
       {
            word: "Aasen",
            is_verb: false,
            is_number: {
                ordinal: false,
                is_number: false,
                roman: false,
                quantity: false
            },
            tagging: [
                "Aasen",
                "subst",
                "prop",
                "<*sen>",
                "<*>"
            ],
            options: "Aasen subst prop <*sen> <*>",
            is_subst: true,
            is_prop: true
      },
  ...
  ...
  ],
  tags: [
      "Andreas" ,
      "USA" ,
      "Thoresen" ,
      "Denmark" ,
      "Daae" ,
      "Skodjestrømmen" ,
      ...
      "Aasen" ,
      "Sweden"
  ]

Endpoint: /entities/<entity_name>

Method: GET

Description: It will extract information from DBPedia about the entities

Result properties

  • properties_uri: Properties uri of a given entity
  • types_uri: Types uri of a given entity
  • uri: uri of a given entity
  • redirected_from[optional]: The original entity uri if the entity name has been redirected
  • name: entity name

Example

  data: {
    "properties_uri": "http://<your-app-domain>/entities/Norway/properties",
    "types_uri": "http://<your-app-domain>/entities/Norway/types",
    "uri": "http://<your-app-domain>/entities/Norway",
    "redirected_from": "http://<your-app-domain>/entities/Norway",
    "name": "Norway"
   }

Endpoint: /entities/[entity_name]/types

Method: GET

Description: It will extract the types connected to the entity and try to guess the entity type (person, organization, event, location..)

Result properties

  • entity_detection: It will contain a guess about the type of the entity
  • types: List of types connected to the entity

Example

  data: {
     entity_detection: {
       is_person: false
       is_location: true
       is_event: false
       other: false
       is_org: false
       type: "Location"
       is_work: false
     },
     types:{
       "http://www.w3.org/2002/07/owl#Thing",
       "http://schema.org/Country",
       "http://schema.org/Place",
       ...
     }
  },
  name: Norway,
  entity_uri: http://<your-app-domain>/entities/Norway,
  uri: http://<your-app-domain>/entities/Norway/types

Endpoint: /entities/[entity_name]/properties

Method: GET

Description: It will extract all the properties connected to a given entity

Example

  data: {
       "http://www.w3.org/2000/01/rdf-schema#label": {
            "uri": "http://<your-app-domain>/entities/Norway/properties?name=http%3A//www.w3.org/2000/01/rdf-schema%23label",
            "name": "http://www.w3.org/2000/01/rdf-schema#label"
       },
       "http://www.w3.org/2007/05/powder-s#describedby": {
            "uri": "http://<your-app-domain>/entities/Norway/properties?name=http%3A//www.w3.org/2007/05/powder-s%23describedby",
            "name": "http://www.w3.org/2007/05/powder-s#describedby",
       },
       ...
  }
  ,
  name: Norway,
  entity_uri: http://<your-app-domain>/entities/Norway,
  uri: http://<your-app-domain>/entities/Norway/properties

QueryString

  • name: It will extract the value for this given property.
  • lang: It is the country code. It will extract the value for a given property in a given language. Only used in combination with the name

Example

  uri: http://<your-app-domain>/entities/Norway/properties?name=http%3A//www.w3.org/2000/01/rdf-schema%23label

  data: {
       "http://www.w3.org/2000/01/rdf-schema#label": [
          "Norway",
          "\u0627\u0644\u0646\u0631\u0648\u064a\u062c",  // النرويج
          "Norwegen",
          "Noruega",
          "Norv\u00e8ge", // Norvège
          "Norvegia",
          "\u30ce\u30eb\u30a6\u30a7\u30fc", // ノルウェー
          "Noorwegen",
          "Norwegia",
          "Noruega",
          "\u041d\u043e\u0440\u0432\u0435\u0433\u0438\u044f", // Норвегия
          "\u632a\u5a01" // 挪威
       ]
  },
   "entity_uri": "http://<your-app-domain>/entities/Norway",
   "name": "Norway",
   "uri": "http://<your-app-domain>/entities/Norway/properties?name=http%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema#label"
  }

Example Text

  Ivar Aasen ble født på gården Åsen i Hovdebygda på Sunnmøre som sønn av småbrukeren Ivar Jonsson.
  Han ble døpt Iver Andreas, formen «Ivar» kom i bruk omkring 1845. Gården han vokste opp på var isolert, så han hadde ingen kamerater.
  Dette førte til at han leste mye i de få bøkene familien hadde, deriblant Bibelen. Faren døde i 1826. Det var åtte søsken, og de mistet begge foreldrene tidlig.
  I foreldrenes fravær ble broren det nye familieoverhodet; han satte Ivar til gårdsarbeid og lot ham ikke utvikle de intellektuelle evnene sine, men Ivar utmerket seg likevel ved konfirmasjonen, og presten skrev rosende om ham i kirkeboken.
  Gården Ekset med Sivert og Rasmus Aarflots boksamling var bare 3 kilometer frå Åsen-garden. Aarflot hadde selv gjort observasjoner om slektskap mellom sunnmørsdialekten og gammelnorsk, og dette kan ha inspirert den unge Aasen.
  Aasen lærte seg norrønt, engelsk, fransk og latin.

Read more

Dependencies

  • The Oslo-Bergen tagger: morphosyntactic tagger for Norwegian bokmål and nynorsk. More info about the tagger. This is the output from the tagger.
  • OBT-Stat: Statistical disambiguator for the Oslo-Bergen Part of Speech tagger
  • VISL CG-3: CG compiler. 3rd version of the CG formalism variant
  • Multitagger: Multitagger with lexicon for Norwegian Bokmål and Nynorsk.
  • HusPos: Hunpos is an open source reimplementation of TnT, the well known part-of-speech tagger

About Oslo-Bergen Tagger

The Oslo-Bergen Tagger is a morphosyntactic tagger for Norwegian bokmål and nynorsk. For general information about the tagger, visit its home page: Tekstlab.uio.no.

License

License GPLv3

Author

Domenico Solazzo - Twitter

jroc's People

Contributors

domenicosolazzo avatar engvik avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

jroc's Issues

Similarity search

JRoc should have the possibility to cluster similar tags together.
What is similar?

In v1: the similarity is based on Levenshtein distance
In v2: the similarity is going to be based on similar concepts.

Error using HusPos in Heroku

Error when using the tagger with statistical disambiguation on Heroku.

Error log:

   sh: 1: /app/The-Oslo-Bergen-Tagger/OBT-Stat/hunpos/hunpos-1.0-linux/hunpos-tag: not found
   /app/The-Oslo-Bergen-Tagger/OBT-Stat/lib/disambiguation_context.rb:21:in `initialize': Inconsistent token count in OBT and Hunpos data. (ArgumentError)
   from /app/The-Oslo-Bergen-Tagger/OBT-Stat/lib/disambiguator.rb:153:in `disambiguate'
   from /app/The-Oslo-Bergen-Tagger/OBT-Stat/lib/disambiguator.rb:153:in `new'
   from /app/modules/tagger/../../The-Oslo-Bergen-Tagger/OBT-Stat/bin/run_obt_stat.rb:29:in `run_disambiguator'
   from /app/modules/tagger/../../The-Oslo-Bergen-Tagger/OBT-Stat/bin/run_obt_stat.rb:107:in `<main>'

Documentation: Architecture

Add new documentation about the new architecture for jroc
It should document:

  • The main lib
  • The tasks
  • Pipelines
    • In-Memory
    • Queue-based
  • Webservices
  • Workers
    • Queues

When retrieving an entity by /entities, the result changes based on the capitalization of the word

This is an example:
Uri: https:///entities/usa

It works correctly with:

  • https:///entities/Usa
  • https:///entities/USA

Result:

    {
         uri: "http://<your-domain>/entities/http://www.ontologyportal.org/WordNet#WN30Word-usa",
         data: {
               properties_uri: "http://<your-domain>/entities/http://www.ontologyportal.org/WordNet#WN30Word-usa/properties",
               types_uri: "http://jroc-t1.herokuapp.com/entities/http://www.ontologyportal.org/WordNet#WN30Word-usa/types",
               name: "http://www.ontologyportal.org/WordNet#WN30Word-usa",
               redirected_from: "http://<your-domain>/entities/usa"
         }
    }

Expected:

     {
        uri: "http://<your-domain>/entities/United_States",
        data: {
              properties_uri: "http://<your-domain>/entities/United_States/properties",
              types_uri: "http://<your-domain>/entities/United_States/types",
              name: "United_States",
              redirected_from: "<your-domain>/entities/Usa"
        }
    }

Refactoring: Better package structure

It needs a better package structure.

  • Language detection
  • Tokenization
  • POS Tagger
  • Costituency Parser
  • NERC
    • NER
    • Regex NER
    • Classifier
  • NED
  • Coreference Resolution
  • Polarity Tagging
  • Opinion Detection
  • API

Property tagger

Add property tagger.
It detects aspect words links them with the correct aspect class.
Useful for hotel reviews.

Example

Word found: cleanliness -> bed

Retrieving wrong tags

We should modify both entity recognition and tag recognition.

An example text is:

{"data": "VG henta i fjor ut tal frå inspeksjonane til Statens vegvesen, som viste at 104 bruer i Noreg er svekka. Og brusjefen i Vegdirektoratet vedgår at etterslepet framleis er stort. – Vi har eit relativt stort vedlikehaldsetterslep som vi arbeider med å redusere. Så det er ein stor innsats på gang no for å redusere dette etterslepet, seier han til NRK. Stengde bru. Måndag hastestengde vegvesenet Rauma bru på E136. Dykkarar oppdaga ein usikker brupilar. Dette er ei av landets bruer som sårt trengde oppgradering, og dette arbeidet hadde halde på i lang tid, før brua blei heilt stengt måndag. Det er slik vegvesenet skal jobbe, for å vareta tryggleiken, seier Stensvold. – Dersom vi oppdagar noko vi er usikker på, så må vi stenge til vi har funne ut kor alvorleg det er. Tryggleiken kjem først. Og Stensvold meiner dei har god kontroll på situasjonen, trass det store etterslepet. – Vi har jamlege inspeksjonar, der eventuelle feil ved konstruksjonen vil bli avdekka. Problematisk. Men stenginga av vegen har ikkje vore uproblematisk midt i turistsesongen. Ein av dei som ofte køyrer strekninga, Lars Hardeland frå Nettbuss, ristar oppgitt på hovudet over situasjonen. – Her har dei halde på med oppgraderingsarbeid og lysregulering i to år, og så oppdagar dei at brua er så dårleg at dei stenger den. Eg synest det er heilt ufatteleg, seier han. Lastebileigarforbundet er ei anna gruppe som er råka av vegstenginga, og distriktssjef Dagrunn Krakeli meiner at beredskapen ved brustengingar må bli betre. – Det har skjedd før, og det vil skje igjen, så planen for omkøyringar eller reservebruer frå forsvaret må vere klar, seier ho."}

It recognizes the word Dykkarar as both tag and entity where it should be ignored.

Add Redis support

Add Redis support for using jroc as a background worker in Heroku

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.