Code Monkey home page Code Monkey logo

malaysian-dataset's Introduction

logo

discord


Malaysian-Dataset, We gather Malaysian corpus!

This repository to store corpus for https://github.com/mesolitica/malaya

Speech dataset moved to https://github.com/mesolitica/malaya-speech/tree/master/data

We will keep update this repository overtime.

How we gather dataset?

Social media

  1. We catch most of live data from Twitter, Facebook and Instagram using crawlers, So we just search using Elasticsearch query.

Translation

  1. We use Google Translate.
  2. We use ChatGPT.
  3. We use Malaya translation.

Semisupervised

Teacher-student

  1. Supervised small samples and then trained a base model.
  2. Trained base model predict larger samples, retrain next student models on high confident labelled data.
  3. Repeat.

LLM

  1. Generate using ChatGPT.

Notes

  1. Any missing mp.py, get it at https://gist.github.com/huseinzol05/98974ae8c6c7a65d4bc0af9f5003786a
  2. Any missing python scripts, please contact me ASAP or create an issue.
  3. Please at least email us first before distributing these data. Remember all these hard workings we want to give it for free.
  4. What do you see just the data, but nobody can see how much we spent our cost to make it public.

Suggestion

  1. Feel free to contact me to request new dataset.
  2. Feel free to open an issue if the link to dataset is forbidden, sometime I forgot to make it open to public.

Non-commercial Usage

A lot of data here semisupervised / translated / tagged / decoded using third party software, example, Google Translate, Google Speech, so to avoid any future complication, it is better not use this data for commercial purposes but allow for certain research purposes.

Acknowledgement

Thanks to Im Big, LigBlou, Mesolitica and KeyReply for sponsoring AWS Google and private cloud to deploy distributed crawlers.

malaysian-dataset's People

Contributors

huseinzol05 avatar aisyahrzk avatar ammar-azman avatar wanadzhar913 avatar syafie-nzm avatar amzar96 avatar hazqeel09 avatar haizadtarik avatar carrotzrule123 avatar hazmannaim avatar kamaruladha avatar atqnp avatar farhan-helmy avatar moiralah avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.