Code Monkey home page Code Monkey logo

africanlp-public-datasets's Introduction

AfricaNLP-Public-Datasets

A repository for publicly/freely available Natural Language Processing (NLP) datasets for African languages.

Datasets per task (Randomly ordered)

Machine Translation

  • TANZIL: A translated Quran to 42 languages, including African languages such as Amharic, Hausa, Somali, and Swahili.

  • MENYO-20k: A Yorùbá-English multi-domain parallel text dataset.

  • FFR: A Fon-French parallel text dataset.

  • Hausa Corpus: A Hausa-English parallel text dataset.

  • CCAligned: A parallel text dataset for English and 137 languages, including 30 African Languages.

  • ParaCrawl: A parallel text dataset for 41 languages, including Somali and Swahili.

  • WikiMatrix: A parallel text dataset for 85 languages, including Swahili, Malagasy, and Egyptian Arabic.

  • Ethiopian MT datasets: A parallel text dataset for English paired with 7 Ethiopian languages.

  • English-Luganda: An English-Luganda parallel text dataset.

  • French-Fon and French-Ewe: A parallel text dataset for French paired with Fon and Ewe.

  • Amharic-English: An Amharic-English parallel text dataset.

  • Tigrinya-English: A Tigrinya-English parallel text dataset (Free registration required).

  • Lingala-French: A Lingala-English parallel text dataset (Free registration required).

  • Congolese Swahili-French (Min,Small,Medium): Congolese Swahili-French parallel text datasets (Free registration required).

  • Swahili-French: A synthetic Swahili-French parallel text dataset (Free registration required).

  • English-Hausa (Min, Small): English-Hausa parallel text datasets (Free registration required).

  • English-Swahili: An English-Swahili parallel text dataset (Free registration required).

  • English-Swahili: An English-Swahili textdatasets on two separate files (Free registration required).229,312-Pairs

  • English-Kanuri: An English-Kanuri parallel text dataset (Free registration required).

  • English-Akuapem Twi: An English-Akwapem Twi parallel text dataset.

  • FLORES-101: A parallel text dataset for 101 languages, including 20 African languages.

  • isiXhosa-English: An isiXhosa-English parallel text dataset.

  • Tatoeba: A parallel text dataset for 409 languages, including 28 African languages.

  • Gnome: A technical domain parallel text dataset for 197 languages, including 16 African languages.

  • Ubuntu: A technical domain parallel text dataset for 244 languages, including 24 African languages.

  • OPUS-100: A parallel text dataset for 100 languages, including 9 African languages.

  • TICO-19: A covid-19 domain parallel text dataset for 37 languages, including 13 African languages.

  • Mozila localization: A parallel text dataset for 197 languages, including 18 African languages.

Text Classification

Sentiment Analysis

  • TUNIZI: A Tunizian Arabizi sentiment analysis dataset.
  • NaijaSenti: A sentiment analysis dataset for Hausa, Igbo, Yoruba, and Nigerian Pidgin.

Text Summarization

  • Amharic Summarization: A dataset for Amharic abstractive text summarization.

  • XL-Sum: A dataset for multilingual abstractive text summarization for 44 languages, including 10 African languages.

Named Entity Recognition

  • MasakhaNER: A dataset for Named Entity Recognition of 10 African languages.

  • WikiANN: A dataset for Named Entity Recognition for 282 languages, including several African languages.

  • Yoruba GV NER: Yoruba Named Entity Recognition dataset.

  • Hausa VOA NER: Hausa Named Entity Recognition dataset

Automated Speech Recognition (ASR)

Speech Translation

Monolingual Data

  • Swahili Language Modeling: A Swahili dataset for language modeling and additional datasets for Swahili Syllabic Alphabet and Swahili Word Analogy.

  • OSCAR: A multilingual dataset for 166 languages, including Amharic, Somalia, Yoruba, Egyptian Arabic, Malagasy, Swahili, and Afrikaans.

  • Luganda Agriculture data (Bukedde, Wikipedia): Monolingual datasets for Luganda in agricultural domain from Bukedde and Wikipedia.

  • isiXhosa: A monolingual dataset for isiXhosa.

  • mC4: A multilingual dataset for 101 languages, including 13 African languages.

  • MOT v1.0: A multilingual dataset for 44 languages, including 11 African languages.

Phonetic Dictionary

  • ipa-dict: A Phonetic dictionary for 23 languages including Swahili.

  • za-lex: Lexical pronunciation datasets for 6 languages spoken is South Africa: Afrikaans, Southern Sotho, Xhosa, Zulu, SA English, and Tswana.

Chatbots (Conversational AI) Data

  • AfriWOZ1.0 A set of 6 African dialogue datasets, human-translated from MultiWOZ2.2, for training chatbots or conversational AI.

Other potential sources:

Contributions

This is a growing list of NLP datasets for African languages. Please, if there is any publicly available dataset I missed out, kindly feel free to add it by doing a pull request, contacting me on Twitter, or emailing me at [email protected].

africanlp-public-datasets's People

Contributors

andrews2017 avatar dadelani avatar israelabebe avatar tosingithub avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

africanlp-public-datasets's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.