Code Monkey home page Code Monkey logo

pioner's Introduction

pioNER - named entity annotated datasets and GloVe models for the Armenian language

pioNER corpus provides gold-standard and automatically generated named-entity datasets for the Armenian language.

Alongside the datasets, we release 50-, 100-, 200-, and 300-dimensional GloVe word embeddings trained on a collection of Armenian texts from Wikipedia, news, blogs, and encyclopedia.

Silver-standard dataset

The generated corpus is automatically extracted and annotated using Armenian Wikipedia. We used a modification of Nothman et al and Sysoev and Andrianov approaches to create this corpus. This approach uses links between Wikipedia articles to extract fragments of named-entity annotated texts.

The corpus is split into train and development sets.

Table 1. Statistics for pioNER train, development and test sets

dataset #tokens #sents annotation texts' source
train 130719 5964 automatic Wikipedia
dev 32528 1491 automatic Wikipedia
test 53606 2529 manual iLur.am

Gold-standard dataset

This dataset is a collection of over 250 news articles from iLur.am with manual named-entity annotation. It includes sentences from political, sports, local and world news, and is comparable in size with the test sets of other languages (Table 2). We aim it to serve as a benchmark for future named entity recognition systems designed for the Armenian language.

The dataset contains annotations for 3 popular named entity classes: people (PER), organizations (ORG), and locations (LOC), and is released in CoNLL03 format with IOB tagging scheme. During annotation, we generally relied on categories and guidelines assembled by BBN Technologies for TREC 2002 question answering track

Tokens and sentences were segmented according to the UD standards for the Armenian language from ArmTreebank project.

Table 2. Comparison of pioNER gold-standard test set with test sets for English, Russian, Spanish and German

test dataset #tokens #LOC #ORG #PER
Armenian pioNER 53606 1312 1338 1274
Russian factRuEval-2016 59382 1239 1595 1353
German CoNLL03 51943 1035 773 1195
Spanish CoNLL02 51533 1084 1400 735
English CoNLL03 46453 1668 1661 1671

GloVe embeddings

We also publish GloVe word vector models trained on Armenian texts containing 79 million tokens. The training set included the articles of Armenian Wikipedia, The Armenian Soviet Encyclopedia, a subcorpus of Eastern Armenian National Corpus, and news articles from over a dozen Armenian news websites and blogs. Texts covered topics such as economics, politics, weather forecast, IT, law, society and politics, coming from non-fiction as well as fiction genres.

Similar to the original embeddings published for the English language, we release 50-, 100-, 200- and 300-dimensional word vectors for Armenian with a vocabulary size of 400000.

You can download GloVe models from here.

For more details, refer to the paper.

pioner's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.