cisnlp / glot500 Goto Github PK

View Code? Open in Web Editor NEW

96.0 8.0 3.0 155 KB

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023

Home Page: https://aclanthology.org/2023.acl-long.61

License: Other

Python 67.14% Shell 13.94% TeX 18.93%

acl multilingual multilingual-models multilingual-nlp nlp xlm xlm-r glot500 glot natural-language-processing

glot500's People

Contributors

Stargazers

Watchers

Forkers

afro-lingo gdls jaakko-paavola

glot500's Issues

Dataset not deduplicated

I've re-trained a Mistral (~ Llama) language-specific tokenizer on the training portion of the Yoruba samples and noticed strange tokens. As an example

{"Ìròyìn▁tó▁ṣe▁kókó▁Àbámọ̀▁ni▁yóò▁gbẹ̀yin▁ẹgbẹ́▁Association▁of▁Stingy▁Men▁tí▁kò▁fẹ́▁náwó▁fóbìnrin-▁Akeugbagoldwákàtí▁9▁sẹ́yìn▁Gbọ́,▁Ìṣẹ́jú▁kan▁BBC▁07:00▁UTCwákàtí▁kan▁sẹ́yìn▁Wo▁ohun▁tí▁a▁mọ̀▁nípa▁gbèdéke▁ti▁Sunday▁Igboho▁fún▁àwọn▁Fulani▁ní▁Ibarapa▁àti▁èsì▁tí▁wọ́n▁fún▁unwákàtí▁3▁sẹ́yìn▁Ìwádìí▁kíkún▁lóríi▁kókó▁ìròyìn▁Amad▁Diallo▁darapọ̀▁mọ́▁Manchester▁United8▁Sẹ́rẹ́▁2021▁Èsíò!": 29494}

which is a token which occurs 1832 times in the training split (rg $STRING yoruba_textified.txt | wc -l) in ever so slightly different contexts (i.e., near duplicates).

I hence checked for duplicates in the dataset and found that there are abundant hard duplicates among 4.5M lines which reduce to 1.16M unique lines. I understand that datasets for low-resource languages are noisy, but I presume users expect that hard duplicates do not occur.

To reproduce:

from datasets import load_dataset
from collections import Counter
import numpy as np
import pandas as pd

dataset = load_dataset("cis-lmu/Glot500", "yor_Latn", split="train")
counter = Counter(dataset["text"])
c = sorted(counter.items(), key=lambda counts: counts[1])
_, counts = zip(*c)
counts = np.array(counts)
print(
    pd.DataFrame(counts)
    .round(0)
    .describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
    .T
)
# original counts is 4.5M
#        count     mean       std  min  10%  25%  50%  75%  90%  95%  99%   max
#    1167327.0  3.84903  1.379304  2.0  2.0  2.0  5.0  5.0  5.0  5.0  5.0  10.0

I have briefly checked train splits for some other languages, which also to varying degree comprise duplicates.

kin_Latn: nearly correct

Original length:  415405
Deduplicated length:  401856
      count      mean       std  min  10%  25%  50%  75%  90%  95%  99%  max
0  401856.0  1.033716  0.180498  1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0  2.0

uzb_Latn: OK

Original length:  3182175
Deduplicated length:  3182175
       count  mean  std  min  10%  25%  50%  75%  90%  95%  99%  max
0  3182175.0   1.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

ibo_Latn: bad

Original length:  5608630
Deduplicated length:  1526812
       count      mean       std  min  10%  25%  50%  75%  90%  95%  99%   max
0  1526812.0  3.673425  0.739383  2.0  2.0  4.0  4.0  4.0  4.0  4.0  4.0  20.0

wol_Latn: nearly correct

Original length:  92358
Deduplicated length:  92357
     count      mean       std  min  10%  25%  50%  75%  90%  95%  99%  max
0  92357.0  1.000011  0.003291  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0

tel_Telu data appears to be missing on HF Glot500

Hi there!

With now having access to the full Glot500 and starting my experiments, I realized that the publicly available Glot500 on HF does not seem to comprise the 41580525 sentences (or a probably very large subset thereof) for tel_Telu.

NER

How to reproduce the NER evaluation?

Access to Glot500-c

Hello,

I came across your work through a virtual talk by Prof. Schütze and found it to be a valuable resource. I'm particularly interested in the Glot500-c(Glot500 corpus) data.

At the moment, your README mentions that access to the corpus will be given after filling an online form, and the form will be available soon. Is there any tentative date for the release of the form? The multilingual corpus would greatly assist my research group with our research on language models.

Thank you for maintaining such a helpful repository, your work is greatly appreciated 😊

Recent HF commit has added Taxi1500 to Glot500

Yesterday, the Taxi1500 corpus was added to Glot500 on Github. I think is not meant to be the case.

Inconsistent columns in arrow files on Hugging Face datasets

tl;dr: some shards of languages below (potentially more) have the extra column "__index_level_0__". The dataset thus cannot be fully loaded.

Thanks for providing a potentially super cool dataset for multilingual NLP research!

While my request for access to the full Glot500-c is still awaiting processing, I thought I would try to use what's available on Hugging Face and quickly ran into the issue already documented here https://huggingface.co/datasets/cis-lmu/Glot500/discussions/3

While I am still loading the dataset, of the train split of the first 139 languages, the arrow files of

afr_Latn
amh_Ethi
ara_Arab
en_Latn
fra_Latn
hau_Latn
mlg_Latn
nya_Latn
sna_Latn
som_Latn
sot_Latn
swa_Latn
zul_Latn

have inconsistent column names. That is, some shards have "__index_level_0__" as an extra column. The below python file slowly but eventually should fix the problem.

# not mega pretty but gets the job done
from datasets import load_dataset, concatenate_datasets
from pathlib import Path
from datasets.exceptions import DatasetGenerationError

CWD = Path.cwd() # inside Glot500 folder
BACKUP = Path("../Glot500_backup")
if not BACKUP.exists():
    BACKUP.mkdir()
SPLIT = "train"
langs = [p for p in CWD.glob("*") if p.is_dir() and "_" in str(p)]


def fix(lang: str, lang_split_dir: str, paths: list[Path]):
    datasets = []
    original_dir = BACKUP / lang / SPLIT
    if not original_dir.exists():
        original_dir.mkdir(parents=True)

    for path in paths:
        new_path = original_dir.joinpath(path.name)
        path.rename(new_path)

    for path in original_dir.glob("*.arrow"):
        datasets.append(
            load_dataset("arrow", data_files={"train": str(path)}, split="train")
        )
    col = "__index_level_0__"
    datasets_ = []
    counter = 0
    for d in datasets:
        if col in d.features:
            d_ = d.remove_columns(col)
            counter += 1
        else:
            d_ = d
        datasets_.append(d_)
    print(f"Cleaned up {counter} shards for {SPLIT} of {lang}")
    dataset = concatenate_datasets(datasets_)
    dataset.save_to_disk(lang_split_dir)


datasets = {}
for i, lang in enumerate(langs):
    print(f"Processing {i}/{len(langs)}: {lang}")
    lang_train = lang / "train"
    lang_train_arrow = list(map(str, lang_train.glob("*.arrow")))
    try:
        datasets[lang] = load_dataset(
            "arrow", data_files={"train": lang_train_arrow}, split="train"
        )
    except DatasetGenerationError as e:
        print(f"Fixing {lang}")
        fix(lang.stem, str(lang_train), list(lang_train.glob("*.arrow")))
        lang_train_arrow = list(map(str, lang_train.glob("*.arrow")))
        datasets[lang] = load_dataset(
            "arrow", data_files={"train": lang_train_arrow}, split="train"
        )

Proposed fix: I suppose it would be relatively straightforward for you to run a variant of the above script and reupload the fully loadable dataset.

I would highly appreciate also getting full access to the dataset :)

Thanks a lot in advance!

glot500-large

Are there plans to train a large glot500 model?
Thanks for your work so far!

cisnlp / glot500 Goto Github PK

glot500's People

Contributors

Stargazers

Watchers

Forkers

glot500's Issues

Dataset not deduplicated

tel_Telu data appears to be missing on HF Glot500

NER

Access to Glot500-c

Recent HF commit has added Taxi1500 to Glot500

Inconsistent columns in arrow files on Hugging Face datasets

glot500-large

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent