Code Monkey home page Code Monkey logo

adh-en_mt_dataset's Introduction

ADH - ENG LANGAUGE DATASET

About

This repository contains Dhopadhola and English Sentences that can be used for Machine Translation. The text comes from several domains and was scrapped from different sources online and in print media.

I did this as part of my submission for AI4D Language Dataset Challenge Round 2. My submission was not selected but I have decided to make the data open source for anyone to use as that was my initial goal and that of the challenge.

NLP, Machine Translation, Africa, Uganda

Table of Contents

About Dataset

This dataset was created to provide Dhopadhola(ADH) to English Parallel sentences to help in availing services that require Natural Language Processing to Dhopadhola speakers.

The dataset can be used for Machine Translation purposes. It consists of 2484 parallel (Dhopadhola and English) sentences from different domains and 3386 monolingual Dhopadhola sentences. Both Supervised and Semi-supervised MT can utilise this dataset.

The dataset can also be used to study transfer learning in related African languages as it is closely related to Dholuo spoken in Kenya & Tanzania, Acholi, Lango and Alur in Uganda and other Luo languages.

Dhopadhola is a very low resourced language; it has very few resources available publicly on the internet and even in other print media. This dataset is will help in the availability of Dhopadhola in digital media as when the task for which it is intended for(Machine Translation) is implemented, more resources will be translated into the language and also the native speakers will be incentivized to use it online eg on social media because non-speakers can get the translations.

Dataset Composition

Get the most updated information from [the datasheet](./Clean Language Data/Ogayo_documentation_2.pdf)

Repo Structure

This repo contains 3 main folders of interest.

1. Clean language data

Contains all the text combined from different source files. Datasheets expounding on the data also available.

2. Raw data

Contains sentence in their individual source files. Not that raw as some cleaning has already been done. If you need the webpage or the document without any form of manipulation, let me know.

3. Notebooks

Jupyter Notebooks that I used to scrape and clean the data. They need some clean-up though.

Clone

  • Clone this repo to your local machine using https://github.com/Pogayo/ADH-EN_MT_Dataset

Contributing

To get started...

Step 1

  • Option 1

    • ๐Ÿด Fork this repo!
  • Option 2

    • ๐Ÿ‘ฏ Clone this repo to your local machine using https://github.com/Pogayo/ADH-EN_MT_Dataset

Step 2

  • HACK AWAY! ๐Ÿ”จ๐Ÿ”จ๐Ÿ”จ

Step 3

  • ๐Ÿ”ƒ Create a new pull request

Team

Perez Ogayo

Perez Ogayo

  • We are a small team. Join us and let's put Africa on the NLP Map together!

Support me

I am in the process of setting up a wallet. Feel free to reach out to me so that I can give you other payment details in the meantime.


License

CCBY4 licence

This work is licensed under a Creative Commons Attribution 4.0 International License.

adh-en_mt_dataset's People

Contributors

pogayo avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.