Code Monkey home page Code Monkey logo

masakhane-ner's Introduction

MasakhaNER: Named Entity Recognition for African Languages

This repository contains the code for training NER models for the two MasakhaNER projects:

  • MasakhaNER 1.0: NER dataset for 10 African languages (Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian-Pidgin, Swahili, Wolof and Yorùbá). This annotation for the dataset was performed by volunteers from the Masakhane community, leveraging the participatory research design that has been shown to be successful for building machine translation models .

  • MasakhaNER 2.0: An expansion of MasakhaNER 1.0 to 20 African languages, the dataset includes all MasakhaNER 1.0, except for Amharic, and 11 new languages from West Africa (Bambara, Ewe, Fon, and Twi), Central Africa (Ghomala) and Southern Africa (Chichewa, Setwana, chiShona, isiXhosa, and isiZulu). The project has been generously funded by Lacuna Fund. More details about the project can be found here.

Required dependencies

  • python
    • transformers : state-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch.
    • seqeval : testing framework for sequence labeling.
    • ptvsd : remote debugging server for Python support in Visual Studio and Visual Studio Code.
pip install transformers seqeval ptvsd

License information

The code is based on HuggingFace implementation (License: Apache 2.0).

The license of the NER dataset is in CC-BY-4.0-NC, the monolingual data have difference licenses depending on the news website license.

Dataset information

Load dataset on HuggingFace

from datasets import load_dataset
data = load_dataset('masakhaner', 'yor')
data = load_dataset('masakhane/masakhaner2', 'yor')

African NER model

We provide a single multilingual NER model for all the 20 African languages on Huggingface Model Hub

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("masakhane/afroxlmr-large-ner-masakhaner-1.0_2.0")
model = AutoModelForTokenClassification.from_pretrained("masakhane/afroxlmr-large-ner-masakhaner-1.0_2.0")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Emir of Kano turban Zhang wey don spend 18 years for Nigeria"
ner_results = nlp(example)
print(ner_results)

Predict the best transfer language for zero-shot adaptation

If your language is not supported by our model, you can predict the best transfer language to adapt from that would give the best performance. This also support non-African languages because we trained the ranking model on both African and non-African languages (in Europe and Asia). More details can be found MasakhaNER2.0/ directory and in the paper.

To run the code, follow the instructions on LangRank based on this paper, and install the requirements. Run code in ranking_languages/.

This is an example for Sesotho.

export LANG=sot
python3 langrank_predict.py -o ranking_data/datasets/ner-train.orig.$LANG -s ranking_data/datasets_spm/ner-train.orig.spm.$LANG -l $LANG -n 3 -t NER -m best

#1. ranking_data/datasets/ner_tsn : score=1.96
#	1. Entity overlap : score=1.55; 
#	2. GEOGRAPHIC : score=0.99; 
#	3. INVENTORY : score=0.66
#2. ranking_data/datasets/ner_swa : score=-0.19
#	1. INVENTORY : score=0.70; 
#	2. Transfer over target size ratio : score=0.51; 
#	3. GEOGRAPHIC : score=0.49
#3. ranking_data/datasets/ner_nya : score=-0.57
#	1. INVENTORY : score=0.85; 
#	2. GEOGRAPHIC : score=0.64; 
#	3. GENETIC : score=0.34

BibTeX entry and citation info

If you make use of the MasakhaNER 1.0 dataset, please cite the our TACL paper. For the MasakhaNER 2.0, please cite our EMNLP paper :

@article{10.1162/tacl_a_00416,
    author = {Adelani, David Ifeoluwa and Abbott, Jade and Neubig, Graham and D’souza, Daniel and Kreutzer, Julia and Lignos, Constantine and Palen-Michel, Chester and Buzaaba, Happy and Rijhwani, Shruti and Ruder, Sebastian and Mayhew, Stephen and Azime, Israel Abebe and Muhammad, Shamsuddeen H. and Emezue, Chris Chinenye and Nakatumba-Nabende, Joyce and Ogayo, Perez and Anuoluwapo, Aremu and Gitau, Catherine and Mbaye, Derguene and Alabi, Jesujoba and Yimam, Seid Muhie and Gwadabe, Tajuddeen Rabiu and Ezeani, Ignatius and Niyongabo, Rubungo Andre and Mukiibi, Jonathan and Otiende, Verrah and Orife, Iroro and David, Davis and Ngom, Samba and Adewumi, Tosin and Rayson, Paul and Adeyemi, Mofetoluwa and Muriuki, Gerald and Anebi, Emmanuel and Chukwuneke, Chiamaka and Odu, Nkiruka and Wairagala, Eric Peter and Oyerinde, Samuel and Siro, Clemencia and Bateesa, Tobius Saul and Oloyede, Temilola and Wambui, Yvonne and Akinode, Victor and Nabagereka, Deborah and Katusiime, Maurice and Awokoya, Ayodele and MBOUP, Mouhamadane and Gebreyohannes, Dibora and Tilaye, Henok and Nwaike, Kelechi and Wolde, Degaga and Faye, Abdoulaye and Sibanda, Blessing and Ahia, Orevaoghene and Dossou, Bonaventure F. P. and Ogueji, Kelechi and DIOP, Thierno Ibrahima and Diallo, Abdoulaye and Akinfaderin, Adewale and Marengereke, Tendai and Osei, Salomey},
    title = "{MasakhaNER: Named Entity Recognition for African Languages}",
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {9},
    pages = {1116-1131},
    year = {2021},
    month = {10},
    abstract = "{We take a step towards addressing the under- representation of the African continent in NLP research by bringing together different stakeholders to create the first large, publicly available, high-quality dataset for named entity recognition (NER) in ten African languages. We detail the characteristics of these languages to help researchers and practitioners better understand the challenges they pose for NER tasks. We analyze our datasets and conduct an extensive empirical evaluation of state- of-the-art methods across both supervised and transfer learning settings. Finally, we release the data, code, and models to inspire future research on African NLP.1}",
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00416},
    url = {https://doi.org/10.1162/tacl\_a\_00416},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00416/1966201/tacl\_a\_00416.pdf},
}


@inproceedings{adelani-etal-2022-masakhaner,
    title = "{M}asakha{NER} 2.0: {A}frica-centric Transfer Learning for Named Entity Recognition",
    author = "Adelani, David  and
      Neubig, Graham  and
      Ruder, Sebastian  and
      Rijhwani, Shruti  and
      Beukman, Michael  and
      Palen-Michel, Chester  and
      Lignos, Constantine  and
      Alabi, Jesujoba  and
      Muhammad, Shamsuddeen  and
      Nabende, Peter  and
      Dione, Cheikh M. Bamba  and
      Bukula, Andiswa  and
      Mabuya, Rooweither  and
      Dossou, Bonaventure F. P.  and
      Sibanda, Blessing  and
      Buzaaba, Happy  and
      Mukiibi, Jonathan  and
      Kalipe, Godson  and
      Mbaye, Derguene  and
      Taylor, Amelia  and
      Kabore, Fatoumata  and
      Emezue, Chris Chinenye  and
      Aremu, Anuoluwapo  and
      Ogayo, Perez  and
      Gitau, Catherine  and
      Munkoh-Buabeng, Edwin  and
      Memdjokam Koagne, Victoire  and
      Tapo, Allahsera Auguste  and
      Macucwa, Tebogo  and
      Marivate, Vukosi  and
      Elvis, Mboning Tchiaze  and
      Gwadabe, Tajuddeen  and
      Adewumi, Tosin  and
      Ahia, Orevaoghene  and
      Nakatumba-Nabende, Joyce  and
      Mokono, Neo Lerato  and
      Ezeani, Ignatius  and
      Chukwuneke, Chiamaka  and
      Oluwaseun Adeyemi, Mofetoluwa  and
      Hacheme, Gilles Quentin  and
      Abdulmumin, Idris  and
      Ogundepo, Odunayo  and
      Yousuf, Oreen  and
      Moteu, Tatiana  and
      Klakow, Dietrich",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-main.298",
    pages = "4488--4508",
    abstract = "African languages are spoken by over a billion people, but they are under-represented in NLP research and development. Multiple challenges exist, including the limited availability of annotated training and evaluation datasets as well as the lack of understanding of which settings, languages, and recently proposed methods like cross-lingual transfer will be effective. In this paper, we aim to move towards solutions for these challenges, focusing on the task of named entity recognition (NER). We present the creation of the largest to-date human-annotated NER dataset for 20 African languages. We study the behaviour of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, empirically demonstrating that the choice of source transfer language significantly affects performance. While much previous work defaults to using English as the source language, our results show that choosing the best transfer language improves zero-shot F1 scores by an average of 14{\%} over 20 languages as compared to using English.",
}

masakhane-ner's People

Contributors

dadelani avatar hugolpz avatar michaelroeder avatar neubig avatar sebastianruder avatar seyyaw avatar shmuhammadd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

masakhane-ner's Issues

Word separator character in the Amharic dataset

Thank you for providing such a nice dataset. We are currently working on integrating them into GERBIL to enable other researchers to use them more easily. However, while working with the Amharic dataset, we encountered a severe issue.

Problem description

The Amharic language uses the character to separate words from each other. This character is not used in the dataset, which looks like a reasonable decision, since it should be possible to add it automatically, when reading a document from file. However, in some situations, there is a single word separator character within the dataset, e.g., at https://github.com/masakhane-io/masakhane-ner/blob/main/data/amh/test.txt#L83. This seems to be wrong, since it makes it harder to process the dataset. Either the word separator should be present between all words, or it should be skipped completely and left to the consumer of the dataset to add it in the correct places.

Proposed fix

Either add the word separator to all places in which the Amharic language would put them, OR remove all word separators and expect the consumer of the dataset to add it while loading the dataset.

No dataset for Luo

Hi,

I found that there is no data for the Luo language in this repository, and not included on the Huggingface page as well. Could you also public the data to make the dataset complete?

Many thanks!

Feature request: Fula Pular

Are any of these languages related to Pular? My neighbors only speak Pular so this would be a game-changer to be able to converse with them.
Thanks for your consideration!

Faulty full stop character in the Amharic dataset

Thank you for providing such a nice dataset. We are currently working on integrating them into GERBIL to enable other researchers to use them more easily. However, while working with the Amharic dataset, we encountered a severe issue.

Problem description

The Amharic language uses punctuation characters that are not common in other languages. The two important characters for this issue are the word separator and the full stop .

This is an excerpt of the dataset (dev.txt):

አምቦ B-LOC
ከዚህ O
በኋላ O
የቱሪዝም O
የባህል O
እና O
የፖለቲካ O
ማዕከል O
ትሆናለች O
፡፡ O

The last character should be a full stop, i.e., . However, in this example and in other sentences in the dataset, the last line comprises two word separators (2x). I think that this is a mistake and should be fixed within the dataset.

Proposed fix

Replace ፡፡ with in all three files of the Amharic dataset.

Improve readme.md

Definition : What this code does ?
Install: Required dependencies, commands to install ?
Run: demo run command.
Contribute: how to add a language ?
License.

Thank a lot for this project. 🙏🏼 African languages need more of those.

Repository license

The arxiv document states the research as under open license CC-BY.

Can the authors confirm this code is open source, open license ?

I can then submit a PR with GNU + CC-BY license.

Truncated results for XLM-R and mBERT

Hi,

It seems that some of the prediction files truncate sentences too short. For example, here is a long sentence in Hausa in the test file:

but this sentence is truncated in the XLM-R results:

Here's a similar result for mBERT:

Maybe you need to increase the maximum sequence length in whatever software you're using to be able to handle the whole sentences?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.