OSCAR. This large corpus contains articles from many sources crawled by
CommonCrawl and extracted by ALMAnaCH. In total there are
4B words tokens and 2B word types. (NOTE: Contains strong language, mostly coming from gambling sites.)
Leipzig corpora collection. Indonesian mixed corpus
based on material from 2013. Sentences: 74,329,815 - Types: 7,964,109 - Tokens: 1,206,281,985. From news materials, randomly chosen websites, and Wikipedia dumps.
Indonesian Treebank. This corpus contains 1K parsed
sentences. (constituency parsing)
UD Indonesian. This corpus is
provided by Universal Dependencies. Training, development,
and testing split are already provided. (dependency parsing)
Machine translation
PANL10N EN-ID news parallel corpus.
This corpus has sentences from news articles from several categories: economy (6K sentences),
international (6K sentences), science (6K sentences), and sport (4K sentences).
OPUS (Open Parallel Corpus). This site contains parallel corpora of Indonesian and other languages
based on openly available resources (e.g., OpenSubtitles).
IDENTICv1.0 [paper].
Indonesian (ID)-English (EN). 45k sentences/~1M tokens (ID). Domain: science, sport, international, economy, news article, movie subtitle. It may overlap with PANL10N corpus. The dataset has versions with raw and tokenized sentences, and in CoNLL format.
IWSLT2017 [paper].
ID-EN. ~100k sentences. TEDtalk subtitles (spoken language).
NOTE: the test set tst2017-plus provided contains a small part of the train data (as mentioned here).
Asian Language Treebank [paper].
ID, EN, and some Asian languages (mostly South East Asian). 20k sentences. Domain: News.
Word normalization
Colloquial Indonesian Lexicon.
This lexicon consists of 3592 unique colloquial tokens that are mapped onto 1742 unique lemmas. The full description of this lexicon can be seen in the paper.
Text summarization
IndoSum.
A collection of 20K online news article-summary pairs belonging to 6 categories and 10 sources.
It has both abstractive summaries and extractive labels.
Text classification
SMS Spam.
This corpus contains 1143 sentences that have been labeled with normal message, fraud, promotion. It is provided by Yudi Wibisono
Hate Speech Detection.
This dataset consists of 713 tweets in the Indonesian language with 453 non hate speech and 260 hate speech tweets.
Abusive Language Detection.
A collection of tweets for abusive language detection in Indonesian social media. It consists of two types of labelling, abusive/not abusive and not abusive/abusive but not offensive/offensive. It also has its own colloquial Indonesian lexicon.
Speech recognition
TITML-IDN speech corpus.
The corpus contains 20 speakers (11 male and 9 female), where each of the speaker speaks 343 utterances.
The utterances are phonetically balanced.
The corpus itself is free to use for academic/non-commercial usage, but interested party should make a formal request via email to the institution.
The procedure is listed here.
Indonesian Speech Recognition.
A small corpus of 50 utterances by a single male speaker. Disclaimer: This is a school project, do not use it for any important tasks. The author is not responsible for the undesired results of using the data provided here.
CMU Wilderness Multilingual Speech Dataset.
A dataset of over 700 different languages providing audio, aligned texts, and word pronunciations.
One of the languages is Indonesian. The utterances are read from the bible, which is recorded by bible.is.