pioNER - named entity annotated datasets and GloVe models for the Armenian language

pioNER corpus provides gold-standard and automatically generated named-entity datasets for the Armenian language.

Alongside the datasets, we release 50-, 100-, 200-, and 300-dimensional GloVe word embeddings trained on a collection of Armenian texts from Wikipedia, news, blogs, and encyclopedia.

Silver-standard dataset

The generated corpus is automatically extracted and annotated using Armenian Wikipedia. We used a modification of Nothman et al and Sysoev and Andrianov approaches to create this corpus. This approach uses links between Wikipedia articles to extract fragments of named-entity annotated texts.

The corpus is split into train and development sets.

Table 1. Statistics for pioNER train, development and test sets

dataset	#tokens	#sents	annotation	texts' source
train	130719	5964	automatic	Wikipedia
dev	32528	1491	automatic	Wikipedia
test	53606	2529	manual	iLur.am

Gold-standard dataset

This dataset is a collection of over 250 news articles from iLur.am with manual named-entity annotation. It includes sentences from political, sports, local and world news, and is comparable in size with the test sets of other languages (Table 2). We aim it to serve as a benchmark for future named entity recognition systems designed for the Armenian language.

The dataset contains annotations for 3 popular named entity classes: people (PER), organizations (ORG), and locations (LOC), and is released in CoNLL03 format with IOB tagging scheme. During annotation, we generally relied on categories and guidelines assembled by BBN Technologies for TREC 2002 question answering track

Tokens and sentences were segmented according to the UD standards for the Armenian language from ArmTreebank project.

Table 2. Comparison of pioNER gold-standard test set with test sets for English, Russian, Spanish and German

test dataset	#tokens	#LOC	#ORG	#PER
Armenian pioNER	53606	1312	1338	1274
Russian factRuEval-2016	59382	1239	1595	1353
German CoNLL03	51943	1035	773	1195
Spanish CoNLL02	51533	1084	1400	735
English CoNLL03	46453	1668	1661	1671

GloVe embeddings

We also publish GloVe word vector models trained on Armenian texts containing 79 million tokens. The training set included the articles of Armenian Wikipedia, The Armenian Soviet Encyclopedia, a subcorpus of Eastern Armenian National Corpus, and news articles from over a dozen Armenian news websites and blogs. Texts covered topics such as economics, politics, weather forecast, IT, law, society and politics, coming from non-fiction as well as fiction genres.

Similar to the original embeddings published for the English language, we release 50-, 100-, 200- and 300-dimensional word vectors for Armenian with a vocabulary size of 400000.

You can download GloVe models from here.

For more details, refer to the paper.

veronikabibika / pioner Goto Github PK

pioner's Introduction

pioNER - named entity annotated datasets and GloVe models for the Armenian language

pioner's People

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent